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Foreword 


It's more than half a century since the structure of DNA was discovered, and 
almost two decades since we first sequenced the human genome. Yet, it 
is only now that genetics is truly coming of age. This is because genetics 
was always about the differences between individuals, but only recently has 
it become technically feasible — and affordable — to systematically and 
comprehensively study the differences in DNA sequence between individ- 
uals. This is happening both in large population-scale genetic studies, but 
also commercially through what is known as direct-to-consumer testing. 

The key fact is that many of us now know the content of our genomes. 
This is in strong contrast to the time when the first human genome was 
drafted, around the year 2000. At that time, we knew the contents of only 
one single genome. This was presented as the map to understanding of 
our biology and health. In many ways, it has become that. However, it is 
also not surprising that real understanding came as a result not of that one 
genome, but of the comparison of thousands and millions of genomes, of 
people who differ from each other in many important ways. After all, such 
comparison was the only way of learning to decode an entirely unknown 
language written with only four letters: the A, C, G, and T nucleotides of 
the DNA alphabet. 

We now know the function of many of the “words” and “sentences” 
in our genome, written in these four letters. We know some that affect 
our hair colour and some that affect our disease risk, and in some cases, 
we can begin to make reasonable predictions about such traits, just from 
reading the genome of a person. What we can predict, and how well we 
can do it, however, is bound to expand in the years to come. 

But genetic prediction also has limitations. Your DNA code, written 
already at the time of your conception, does not lock you into a predes- 
tined path of life with a set expiry date. Such determinism would not be 
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compatible with how dynamic life is as we know it — not the life of an 
insect through its short time on earth, nor the life of a human being in our 
complex modern society. DNA only provides the underlying blueprint on 
which your life will play out. So, what can we use it for? 

Researchers around the world are preoccupied with exactly this ques- 
tion — how can we use the rapidly expanding knowledge on our genomes? 
We currently use genetics for a wide range of applications, many of which 
go beyond mere prediction. In my own research, for example, we seek to 
utilise knowledge of genetic variants associated with diseases to create a 
in-silico test that can help to develop new drugs or repurpose old ones. This 
test follows from the hypothesis that the molecular effects of an effective 
drug should in some way reverse the changes in genome function from a 
diseased state back to a healthy state. In this way, our detailed mapping 
of the genome may ultimately turn into improved treatments that do not 
even require reading the DNA of the individual patient. 

But such research is just one example of the infinite number of creative 
ways in which our genetic knowledge can be utilised. It is likely that we 
will see an explosion of new applications in the next few years. We can, 
for example, imagine a future where our health status and vulnerabilities 
are routinely interpreted in light of our genetic makeup. We see this hap- 
pening already for some specific cases where tangible benefits have been 
identified. These include more efficient screening programmes for rare 
congenital diseases, as well as improved treatment selection in precision 
medicine for cancer. The list is certain to grow in the coming years. 

A main challenge in such a future is that few genomic findings will be 
straightforward to interpret. If a disease is caused by mutations at a single 
gene, then it is also likely to be very rare, because of the negative impact 
most such diseases have on survival and reproduction. Common diseases 
are common because they usually result from an accumulation of many 
common genetic variants and environmental factors, such that the effect 
of any single genetic variant is small and depends on the environment as 
well as variants present in the rest of the genome. This underscores the fact 
that much of the genetic knowledge we gather is of a non-deterministic 
nature and needs to be interpreted in context. This is a main challenge 
of genetics today, not only in healthcare systems but also particularly in 
consumer genetics settings. 
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Thus, our genomes do affect our lives, but they typically do so in 
non-deterministic ways: modulating our behaviour and biology, making us 
somewhat more sensitive to certain environmental stresses, and somewhat 
more at risk of developing certain diseases. People who do not understand 
this may adopt a fatalistic attitude to their health, and indeed prefer not 
to find out the content of their genomes. It would be a shame if, because 
of this, opportunities for improving health, preventing disease, or more 
effective treatments are missed. On the other hand, there is also the dan- 
ger that the power and benefits of genetic knowledge are overhyped, 
particularly by commercial providers of direct-to-consumer services. This 
could lead to unnecessary anxiety, as well as treatment recommendations 
that are unwarranted or even harmful. For these reasons, it is essential that 
the general public has access to easy to understand educational resources 
that accurately describe how our genomes impact on our lives, and how 
this knowledge can be used for our benefit. 

The book “Understand Your DNA: A Guide” provides exactly such a 
resource. A handbook for people who wish to look into their own genome 
and gain an improved understanding; both of the benefits that can be 
gained from such, but also of the limitations on the extent to which our 
genes dictate our lives. Dr. Lasse Folkersen has done a wonderful job in 
explaining some rather complicated concepts in a way that is not only 
simple, but also interesting. 


Professor Pak Chung Sham 

Suen Chi-Sun Professor in Clinical Science 

Director, Centre For Genomic Sciences 

Co-director, State Key Laboratory of Brain and Cognitive Sciences 
The University of Hong Kong 
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My Daughter's Hair Colour 


Some children look like their mother, some like their father, some are 
hard to tell. My daughter definitely looks like me. Our hair colours are 
almost identical, and face shape is similar as well. At least she got her 
eyes from her mother. | can’t stop marvelling at this. I’m sure I’m not 
the only father (or mother) thinking about these matters, maybe even 
thinking a little too much. But what was a little remarkable is that while 
my hair colour is an average Scandinavian light brown, her mother’s hair 
is complete Chinese jet-black in appearance. And this is very unlike the 
genetics you learn in high school. Here you may have heard that dark 
hair is a dominant trait. That dark plus light gives dark, always. So clearly 
an exception. Not too uncommon though; this happens all the time. It 
could have been all there was to it. But because of my deep interest in 
genetics the exception instead turned me into a long path of analysis, 
ultimately resulting in this text. 

It was puzzling. Sometimes annoying to the point where people would 
ask my wife “where the mother is”, that is until the two of them started 
chatting in fluent mandarin. A puzzle, but potentially a solvable puzzle. So, 
| set out to get DNA samples from my daughter and all her living ances- 
tors, ten people altogether, including myself and my wife (Fig. 1.1). | had 
decided to use 23andme because back then in 2012, it really was the most 
well-known provider. Plus, they offered a toddler sampling kit that could be 
used for sampling. Spit collection from a two-year old can be remarkably 
difficult, in spite of what you may think from the drooling. Collection of spit 
from elderly Chinese in-laws can give some explanation problems as well, 
but they are used to me doing odd things. 
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"Oldemor" "Oldefar" 


"Has" =" "Farmor" 


"ig" "Faster" 


My daughter 


Fig. 1.1. Proband and pedigree of participants in the example study. Proband is a word 
that basically means “the one the genetics study starts from”, so in this case my daughter 
and her hair colour. For privacy reason, | have not given actual names, but instead used 
local-language family labels. Their meaning should be clear from the position in the pedi- 
gree, but they are anyway not important to know, except “Far” who is me and my daughter 
who is indicated. Per pedigree convention, female is indicated as circle and male as square. 
A crossed-out symbol refers to a deceased individual. 


Genetic testing of your entire family is of course something that needs 
to be well planned and well explained. First, | made an agreement with 
everyone in the family that | would use the genetic data only for non-medical 
investigations, with the exception that later medical analysis was possible 
but would require a thorough talk and mutual agreement. 

Secondly, genetic testing is inevitably also a test of parenthood; 
if the father is not who you think it is — you find out from genetics. | had 
decided that if | suspected this, it would be a major no-go. For the sake of 
the family, better to just cancel the project than shake up marriages. From 
large-scale studies that | have been involved in we know that this is very 
real and that 3% to 5% of fathers are not the actual biological parent. Of 
course, the real challenge is to figure out if this may be a problem before 
collecting the samples. The way | did it was to take a serious private talk 
with all the mothers involved (my own mother included), clearly saying that 
one of the things we would find out was the true paternity. Then pausing a 
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little and changing the subject to something more medical, about if they 
were afraid of finding out about diseases. | thought that way | could give 
people an easy way out. And no one took it — indeed all fathers turned 
out to be biological fathers as well. But it is worth thinking about if you 
consider doing a similar investigation. 

After sampling the spit and sending it off, it usually takes several weeks 
before the data arrives. Once that happens, the first thing you get is a page 
on a website with information about specific traits. This is true no matter 
what company is used. Naturally, | zoomed in on the hair colour trait as the 
first thing, and found little new information — my side of the family was 
predicted to be of light hair colour and my wife's side was predicted to be 
of dark hair colour. My daughter was predicted as somewhere in between. 
However, on digging a bit this opened up the whole problem with many 
analytics interfaces: it's based on old research, incomplete findings and 
simplistic assumptions. In this case hair colour was calculated as a function 
of just two SNPs. A SNP, pronounced “snip”, is the functional genetic var- 
iant and we will cover what it is in much more detail later. However, only 
basing something as complex as hair colour on two SNPs was clearly not 
true per the latest research on multi-genic traits. Overall, a good rule of 
thumb in the world of genetics self-analysis is this: be suspicious of the 
sources. Newer research is usually more comprehensive, and common 
traits are almost never decided by just one SNP. 

Fortunately, deeper digging can amend this. Almost all companies, 
23andme included, allows the download of your raw data. Your raw data 
file is a large text or excel file that contains one line for each SNP, typi- 
cally around a million lines. Not something that can readily be processed 
manually, but very useful to get deeper information from. In the case of 
hair colour, it turned out that a later study (Eriksson et al., 2010) had found 
more than 22 SNPs that determined hair colour, on both the blond-to-black 
scale and on the red-to-not-red scale. These SNPs would be a place for a 
correct answer to be found. So | started to find the genotype information 
for each of these 22 SNPs in the data from each of my ten participants, 
including my daughter. 

The most decisive effect was a SNP called “rs16891982”, which is 
known to have very strong effect on hair colour and which completely 
corresponded to ethnicity. All the Chinese family members had C/C and 
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all the European family members had G/G. This is generally true for that 
SNP, not just in my family. The C is only found in people of Asian ethnicity. 
My daughter, of course, had C/G. This is, of course, because of the way 
inheritance works — everyone receives one copy from their mom and 
one copy from their dad. | could only give G and my wife could only give 
C, so C/G was the only possible result for our daughter. We will discuss 
these inheritance patterns in more detail in Chapter 4. C/G is also the 
rs16891982 genotype of all other children with one Asian parent and one 
European parent, so clearly it is not the explanation for an unusually light 
hair colour in my daughter. 

But what about the 21 other hair colour associated SNPs that were 
known? These could contain clues. It turned out that the European side 
of the family, including myself, had a fairly average mix of genotypes in 
these SNPs. Some were of the lighter variant, some of the darker variants. 
This makes sense. We are not platinum blondes, just fairly average brown 
Scandinavian hair colour. But when | extracted these genotypes from my 
wife and the Asian side of the family | found that remarkably many of them 
were of the lighter variant. My wife's family seemed to be genetic blondes, 
so to speak. Looking at a family photo, this is surprising. They all have jet- 
black hair. It illustrates the concept that we will discuss later, the difference 
between phenotype — how you look — and genotype — how your DNA 
looks. So, even with the 21 other hair colour determining SNPs being fairly 
light variants, it makes sense that the rs16891982-Asian-hair-colour-SNP 
trumps them and creates a completely black hair phenotype. This is why 
most Chinese have black hair. 

This trumping effect — or dominance — however was not present 
in my rs16891982-C/G daughter. And having received an otherwise very 
blonde genetic “palette” from her mom, as well as a medium blonde palette 
from me — this was the explanation for our mystery. My wife — a genetic 
blonde — would, together with me, have a much higher chance of having 
light-haired children than what is average for European-Asian couples. 

As a good scientist, | am currently arguing for us to repeat the 
experiment a few times, but that is a family matter that is not agreed 
yet — so far we have only one child — and so, until further replication 
| consider this the answer to the originally posed question. My wife is a 
genetic blonde. 
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This also sets the aim of this book: to cover everything that you can do 
with a personal genome. Your personal genome is the information on the 
A, G, C and T's that make up your DNA code — your book of life. A short 
basic introduction on this is given in Chapter 2, a basic introduction to molec- 
ular biology. The data for a personal genome typically comes from genetic 
tests sold by such companies as 23andme, Ancestry.com, MyHeritage, or 
Family Tree DNA, all collectively called direct-to-consumer (DTC) genetics 
testing companies. But it may also have come from research projects, e.g. 
Genes for Good, or through clinical testing in a hospital. Chapter 3 will give 
more detail on the current approaches and methodologies for how this is 
done. Most consumer companies provide specific analytics interfaces, most 
with specific scopes and aims, for example ancestry. Adding additional anal- 
ysis will enable you to learn more. This is the reason for this book; to provide 
a handbook in all types of personal genome analytics. Specific analysis types 
can be grouped into ancestry studies, rare disease genetics, and common 
disease genetics. Each of these themes will be introduced in Chapters 4, 5 
and 6 respectively, conveniently sorted by how much mathematics it requires 
to understand each. 

Ultimately, the hope with writing this is in part an encouraging pitch 
towards educating yourself in personal genetics. We can learn many things 
from our genes, deep questions about our ancestry, curiosities about hair 
colours, but also medically useful insights. But it is also an emphasis on 
limitations of interpretations; our genes are not a deterministic crystal ball 
that have the complete answer to everything. To gain any form of benefit 
or insight from genetics, most of all it is important to understand this gap 
between no interpretation and over-interpretation. 
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Basic Introduction to Genetics 


The human genome consists of more than three billion DNA letters: 
A, C, T or G. They are, of course, not real letters, but four different kinds 
of molecules, nucleotides, that we have named as A, C, T and G. Strung 
together, they encode all of our genes, and can very well be thought of as 
letters in a book, three billion of them in length. Because it is these genes 
that make all the proteins that exist and make the things that we consist 
of. So, it is very true when people describe the DNA as the blueprint for 
an organism, or book of life. Everything that we consist of can ultimately 
be traced back to information given in our DNA. 

Many people have heard about the sequencing of the human genome 
around the turn of the century, and the immense efforts involved in figuring 
out the more than three billion letters of our DNA. After the efforts of the 
first human genome project, scientists started to ask what the DNA of all 
the other people in the world looked like. Thousands of genomes from 
people all over the world were DNA-sequenced, in a project named the 
Thousand Genomes Project. And it turned out that, by and large, humans 
were fairly similar genetically. In fact, at 99.9% of the DNA positions, almost 
all humans had the exact same letter, the same nucleotide. 


2.1 SNPs, Genomes, and Chromosomes 


To understand better what this means exactly, let us zoom in on a specific 
place in the genome. We will zoom in on a small region containing a SNP 
called “rs16891982", the one we discussed in Chapter 1. The concepts 
are general for any region of the genome, though. 
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Human 1: Asian ancestry 
AGGAAAACACGGAGTTGATGCA C AAGCCCCAACATTCAACCTCGA 
AGGAAAACACGGAGTTGATGCA C AAGCCCCAACATCCAACCTCGA 


€33M letters 147M letters > 


Human 2: Mixed ancestry 
AGGAAAACACGGAGTTGATGCA G AAGCCCCAACATCCAACCTCGA 


€33M letters 147M letters > 


AGGAAAACACGGAGTTGATGCA C AAGCCCCAACATCCAACCTCGA 


Human 3: European ancestry 
AGGAAAACACGGAGTTGATGCA G AAGCCCCAACATCCAACCTCGA  gagupiettes > 
AGGAAAACACGGAGTTGATGCA G AAGCCCCAACATCCAACCTCGA 


€33M letters 


Gorilla 
esom enters AGGAAAACACGGAGTTGATGCA C AAGCCCCAACATCCAACCTCGA  s4miener> 
AGGAAAACACGGAGTTGATGCA C AAGCCCCAACATCCAACCTCGA 


Dog 
€73Miettes AGGAAAACACGGAATTGATGCA G AAGCCCCAACATCCAACCTCGA © tamietters> 
AGGAAAACACGGAATTGATGCA © AAGCCCCAACATCCAACCTCGA 


Fig. 2.1. Sequence of 45 nucleotides on chromosome 5 in five organisms, three of 
whom are human. This region is part of the SLC45A2 gene. The center part highlights the 
rs16891982-SNP. 


In this particular region, the nucleotide code is as shown in Fig. 2.1. 
Long stretches of A, T, C and G. It continues like that in each direction, 
millions of letters. The central highlighted letter, the one that is not iden- 
tical in all of the genome sequences, that is a SNP — one of the positions 
of known variability in humans. SNP is an acronym for Single Nucleotide 
Polymorphisms and it is pronounced snip. The only reason we call this 
position a SNP is because it is not always identical; different humans have 
different letters at this position. Because of projects like the Thousand 
Genomes Project, all SNPs in the human genome are known. 

For a position to be called a SNP, that position must have been 
observed to have different letters in a least some individuals. SNPs are 
recognisable and distinguishable because they all have names beginning 
with rs plus a number, for example rs16891982. We have named them like 
that because we wanted to categorise them and keep track of them — 
which SNPs are associated with which traits. 
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Together with the SNP name, a chromosome number and a position 
is often given. The region in Fig. 2.1 is located at a position given as 
letter number 33,951,943 on chromosome 5. Chromosomes are a level 
of organisation of the genome, where all the DNA sequence is physically 
divided into 22 ordinary chromosomes, as well as X and Y chromosomes 
that determine the sex. They can be loosely thought of as volumes in a 
book, but other than the sex chromosomes, there is nothing much special 
about a location on a specific chromosome. Most traits are scattered around 
chromosomes anyway. 

A SNP may be located in a gene or next to one. A gene is a func- 
tional region of the DNA that is read and translated into the constituents 
of the body: proteins and the products that they in turn make. How that 
happens is the subject of the entire field of molecular biology. Like the term 
chromosomes, however, genes actually play only a secondary role when 
analysing your own genome. Because the functional unit that is measured 
is the SNP. SNPs are important. But now you also know the terms gene 
and chromosome, and they are nice to know. 


2.2 The Alleles that Make the Genotype 
of a SNP 


The next thing you'll notice in Fig. 2.1 is that each individual has two 
sequences of DNA. This is a consequence of sex. We all inherit one copy 
of our genome from our mother and one copy from our father. Of course, 
your mother and your father also had two copies each, so there's a bit of 
random shuffling in which version they'll pass on. Basically, this is the choice 
of which egg cell and which sperm cell is involved in conception. Quite an 
important moment in anyone's life. The specific combination of letters that 
you have for a given SNP is what we call your genotype. 

The word allele is the word we use to describe what letters the gen- 
otype consists of. It just means a letter, in the context of another possible 
letter. So, you can say for example “the C-allele” to distinguish from having 
a G-allele at that genomic position. But it will sound odd if you say “C-allele” 
at a position that always have C in all humans. | will use allele throughout 
the book, but if you quietly translate it as genome letter, that’s fine as well. 
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At each SNP you have a genotype, and that genotype consists of two 
alleles. These are important because it is what makes you unique. Take 
for example dark hair letter count, in the case of hair colour. Here, each 
C of that SNP is known to give darker hair, whereas each G then does the 
opposite. So, in many cases we can just count the number of Cs to get a 
conclusion; i.e. two Cs would on average result in twice as dark hair than 
just one C. 

Counting alleles like this is called additive genetic inheritance (Fig. 2.2). 
There are other types of inheritance, called dominant and recessive genetic 
inheritance. That's when it only matters if your genotype has one G, and 
having two Gs doesn’t add more effect (dominant), or if you absolutely 


Fig. 2.2. Illustration of additive genetic inheritance: A blue plain-pattern shirt and black 
dotted shirt combines to produce a blue dotted shirt. Note that hair colors are not following 
additive inheritance as is the subject of Chapter 1. 
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need two Gs to observe any effect (recessive). Inheritance is the subject 
of Chapter 5. 

Finally, we can be either heterozygous at a SNP (the letters are dif- 
ferent, e.g. C/G) or homozygous (the letters are not different, e.g. C/C or 
G/G). This is mostly a nice to know term, since we may as well just write 
that we are C/G instead of using a Latin term like heterozygous. As long 
you remember SNP, genotype and allele, you can understand the rest of 
this book. 


2.3 DNA as a Universal Code of Life 


There are more interesting details to Fig. 2.1. Notice the two last DNA- 
sequence pairs in the figure. They are not human. One is from a dog and 
another is from a gorilla. The DNA code is used throughout the living 
world: animals, plants, even microscopic bacteria. The sequence is dif- 
ferent, though. The more different two organisms are, the more different 
their DNA code is. Gorillas and dogs are fairly closely related to humans. 
We are all mammals after all, and the gorilla is even in the same order as 
us, the primates. So, it is not surprising that stretches of code are similar; 
particularly since this is a hair colour region, and all three species have hair. 

As you can see, the Gorilla is virtually identical, at least to the Asian 
genome — it seems like this particular combination of nucleotide letters 
really is strongly dark hair associated. This region of the dog genome, 
however, is more similar to the European genomes, maybe following that 
dogs have a larger palette of hair colours. Elsewhere in genomes of the dog 
and the gorilla things look very different from us and we wouldn't even be 
able to compare (“align”) our sequences. But, since this particular region 
is used to encode a hair-related protein, it makes sense that we all have it. 
The point is that every person and every organism looks different — and 
there's a reason for it; the reason is DNA. 

Note one more thing: another nucleotide letter in the dog is also 
different. The ninth position leftwards from the hair colour SNP is always 
an A rather than a G in dogs. Maybe it's because some other quality of 
dog hair is different. Just from the nucleotide change that knowledge is 
not definite. But such differences are the basis for calculating things such 
as “Gorillas are 98% identical to humans” and “dogs are 84%”. Basically, 
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just counting how many nucleotides are different on average. So, overall 
the code of life, DNA, is a universal thing that underpins how we look and 
are in the most basic sense of the word. 


2.4 Different Measurement Methods 


Today we can measure all these things, efficiently, scalably and relatively 
cheaply. When analysing your own genome, it is important to understand 
some details about how we measure it. For example, there is difference 
between the full sequence of DNA and what you get from most direct to 
consumer companies. That's because there is usually no reason to measure 
everything, when we can just measure the 0.1% that we know are variable 
between humans. That is much cheaper and, assuming that the remaining 
99.9% is identical in all humans, it is even correct. We will talk more about 
that assumption later, but except for rare genetic variants it is somewhat 
correct. 

The technology to do this is called microarray and it is different from 
DNA sequencing in that it only measures positions of known variability. 
Microarrays are still the workhorse of large-scale population genetic stud- 
ies. They are designed to measure all the known common variation, and 
they are the product that (almost) all consumer genetics companies sell. 
The technical details of the two methods will be discussed in Chapter 3. 


2.5 Rare Genomic Variation 


It often matters a lot if a SNP is common or rare. A SNP is common if differ- 
ent alleles are found even when just looking in a few people. But humans 
can also have non-common SNPs that are seen rarely or only in very specific 
groups of people. An example is Human 1 in Fig. 2.1. Here, the thirteenth 
position to the right has one T instead of a C. This is uncommon, but not 
unheard of. In a group of a thousand people, only one or two would have 
this T-variation. It has a name, rs143764115, so it has been observed before, 
by some scientists who registered it (Lek et al., 2016). 

Any single position in the human genome can theoretically contain a 
different nucleotide letter. At most positions, it's just highly unlikely, which 
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we know because of projects such as the Thousand Genomes Project. 
The term common and rare SNP is often used to describe this frequency. 
Frequency thus refers to the chance of having the rarer allele. 

For most of the rare SNPs we have no clue what they do to us. Because 
they are so rare we have never or only rarely observed people carrying 
them. Presumably, some of these theoretical mutational changes are so 
fundamentally disrupting to life that a child could never be conceived, let 
alone born, with it; we would never observe those. Others are probably 
completely unimportant. Scientists are busy at work mapping and exploring 
all this, and in the coming years the part of the genome that is understood 
will grow and grow. 

This outlines the level of our current knowledge: many things are 
common, known, and understood — like the rs16891982 hair colour SNP 
and the general concept of hair colour being a function of this one and 
at least 21 other hair colour associated regions. But there also exist rare 
SNPs, and SNPs that are poorly understood. 


2.6 Finding Functions of SNPs in Families 


The methods that we use for deciphering the genome are many and diverse. 
Understanding some of the main ones can be useful in understanding your 
own genome analysis. The most basic investigation, and the one that was 
first used, already decades ago, is the family study. In the family study, we 
map a family with some type of disease. 

Mapping here means constructing pedigrees like the ones shown in 
Fig. 1.1, and noting who has the disease trait and who does not. Then a 
genome measurement — DNA sequencing or microarray etc. — is per- 
formed, and it is calculated if any of the genetic variation measured goes 
with (“segregates”) having the disease trait. Simple in its setup. 

The family study (or linkage study) has a particularly good track record 
in finding explanations for rare genetic disease. The disease trait could in 
theory also have been any other common traits, such as freckles or hair 
colour, or common disease, like stroke. But the linkage study type was 
never very successful for those. We think it's because such common and 
complex traits are harder to track in families. When they are successful, 
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however, the findings from family studies are usually quite strong — as in, 
if you have a given disease allele, then you are almost sure to have the 
disease. That is called high effect size SNPs. Family studies find few high 
effect size SNPs that are rare. They are the focus of Chapter 5. 


2.7 Finding Functions of Common SNPs 
in Populations 


Common complex traits like stroke or hair colour have, since 2008, been 
investigated with another study type: the genome-wide association study 
or “GWAS” (Klein et al., 2005). This study contrasts with linkage studies in 
that there is no family mapping. Instead, there are simply two large groups 
of unrelated individuals. One group with a trait, e.g. a disease, and one 
group without it: the cases and the controls. In each group, the genetics 
is measured — almost always using genotyping microarrays — although 
there is no reason why it could not be DNA sequencing. At every single 
position, we then ask if an allele is more common in the cases compared 
to the controls. That's it. 

What the GWAS have generally found is that each complex trait, 
disease or otherwise, have tens to hundreds of different common SNPs 
associated with them. In contrast to family studies, though, they are not 
high effect — merely statistical observations. A typical finding is that SNPs 
can cause between 1.1 to 1.3 fold increase in risk, what is known as odds 
ratio. This metric will be further explained later (Chapter 5). 

The point is that individually the SNPs found by GWAS won't have a 
very large effect on your life. Only when many SNPs are added together 
will they have any appreciable effect at all. Studies of the GWAS type find 
many low effect size SNPs that are common and scattered over many dif- 
ferent genes. This is the focus of Chapter 6. 


2.8 Missing Heritability and Future Methods 
for Variant Discovery 


Between these two main methods, we know there is still a lot of unknown 
genetics. Interestingly, this is not just a guess. We know it for sure. We know 
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because, for any given trait, it is possible to calculate how much of the trait 
variation is dictated by genetics. This is often done using studies of twins. 

If you know an identical twin pair you may have noticed that some of 
their traits seem to be particularly similar (appearance, hair colour, height). 
Other traits may be as dissimilar as in any sibling couple (temper, taste 
in clothes, personality). Because identical twins have virtually identical 
genomes, we can use these similarities to quantify how much of a trait is 
explained by genetics. This we call the heritability of a trait. 

If a trait is always found to co-occur (or be absent) in pairs of identical 
twins (monozygotic twins), but it is randomly distributed in pairs of fraternal 
twins (dizygotic), then we know this trait has a heritability close to 100%. 
The heritability can be quantified by counting this co-occurrence, and that 
has been done for many traits. 

This means that the whole nature vs nurture debate is pretty much 
over in genetics — because for any given trait we know the percentage of 
nature and the percentage of “nurture” or environment. Environment, in 
this case, should be thought of more broadly than the usual usage of the 
word. For a geneticist, environment simply means everything that is not 
decided by your genes. You can think of it as the degree of freedom you 
have from your genes. 

Things like infectious disease have very low influence from genetics. 
They have low heritability, meaning that they mostly depend on what hap- 
pens to you in your environment. Mostly, but not completely 0% — there 
is such a thing as natural/genetic protection for many infectious diseases. 
Common complex diseases, such as cardiovascular disease, autoimmune 
disease, mental disorders such as schizophrenia etc., are somewhere around 
30% to 60%; and the rare genetic diseases are close to 100% heritable. 
This is shown as the medium and dark blue boxes in Fig. 2.3. 

If genetics explains 50% of the risk of getting a given disease, then 
we should be able to quantify half the overall risk simply by measuring all 
of the genome, right? Yes, in theory. But still wrong in practice. We know 
we cannot, but at least we can quantify how close we are. This is called 
the missing heritability. We know many SNPs that dictate some of the risk 
of a disease (darkest blue bar, Fig. 2.3); we also know that there is some 
of the risk variability left to explain (medium blue bar, Fig. 2.3). We cannot 
guess that accurately, it seems. We will discuss this further in Chapter 6 
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Fig. 2.3. Environment and heritability. Known and unknown heritability. The figure shows 
variation in disease risk (or height) as the total length of each bar. The two darker blue bars 
show how much we believe is explained from genetics. The darkest blue bar alone shows 
how much of this come from specific known SNPs. 


(Data from Fernandez-Pujals et al., 2015; Hyde et al., 2016; Lello et al., 2017; Loh et al., 
2015; Marouli et al., 2017; McPherson & Tybjaerg-Hansen, 2016; Wood et al., 2014) 


about common diseases, but it is worth understanding the limitation — 
both to what genetics can theoretically do, but also what it can practically 
do. Novel analysis methods, larger population analysis, and usage of full 
DNA sequencing is likely to push this number closer to the theoretical limit 
in the future. But, right now, there is a fraction of a trait that we know is 
genetic, and there is fraction of that fraction that we know we can explain. 

We can currently only make genetic predictions according to the 
darkest blue bar, and we will never be able to make genetics predictions 
that explain more than the extent of the dark and medium blue bars com- 
bined. How much this is depends on the trait. A well-defined and highly 
heritable trait like height is a solid 80% heritability, of which 21-40% is 
known and understood. Other traits have low heritability (like risk of clinical 
depression at 30%), and we know only an almost non-existent fraction of 
that, maybe because it is much harder to define and investigate than e.g. 
height. Everything else is somewhere in between. 


Methods in Consumer 
Genetics 


Consumer genetics is a field of growth. There literally exist hundreds of 
different companies, and more are opening every day. At the time of writing 
my first intention was to create an overview of these companies and their 
pros and cons. | now don't believe such a list would be useful past a few 
months after writing. Instead, l'Il aim to list and describe major categories 
and concepts in the field, trying to exemplify them with some of the major 
players. Hopefully that will be slightly more future-proof. 

The marketing material of any company will start by highlighting the 
fantastic things you can find out about yourself, your body and your future. 
As we saw in Chapter 2, this is not completely made up — although it is 
often vastly exaggerated. | will call these things the interpretation. It always 
comes in the form of a web site or report or some kind of verdict on your 
health or ethnicity or whichever trait the company is selling. Before the 
interpretation, however, there is a lot going on in generating some inter- 
pretable data from your spit sample. | will call this step the data generation. 

The data generation relies on a number of technological advance- 
ments that has happened in recent years, and always ends with a big bulky 
computer-only file of several megabytes or even gigabytes. Later in this 
chapter we will learn how they are generated. The interpretation relies — in 
any and all cases — on the state of current genetics science, and the ability 
of the company to keep this updated and well aligned with our current 
understanding. The output of interpretation is plots and conclusions. 

How you use that depends on who you are. Let's go through some 
possible case stories, to see how that may play out. Imagine you are at a 
dinner of a fictive family and they have bought a direct-to-consumer DNA 
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test. Everyone has different conditions for understanding advanced genetics, 
so we can use this as a good example of genetic testing at successively 
more complex levels: first for Alice, the grandmother, then Bob the father, 
and finally Carol the daughter. Each of them being more of a super-user 
than the previous. 


3.1. Case A — the Basic Level 


The first person who bought a genetic test is the grandmother, “Alice”. 
Alice, although now retired, is an adventurous person who is fairly com- 
fortable with computers. She is not afraid of using the Internet. Just two 
months ago, she used her computer to search for “genetic test” and found 
a website that sold those for $200. After paying, she was told to wait for a 
little test tube to arrive in the mail. It did, and she spat in it several times 
and then closed it firmly before putting in the return envelope. That was 
all easy and just a matter of following the instructions. 

But now she received an email saying her results are ready, and she’s 
very excited. What will her experience of consumer genetics be? Alice will 
probably first be met with an ancestry calculation. It'll tell her about how 
many percent of her DNA comes from different countries and regions. 
Maybe it'll be as she expected. Maybe she'll discover that some of her 
DNA is from somewhere entirely unexpected. At this point it would be a 
good idea for her to read Chapter 4 of this book, particularly the section 
about country percentages. Ultimately, it will likely just provide for good 
additions to the family saga. 

Another thing that will likely happen is that Alice will start getting 
contacted by people who say they are related to her. This, of course, 
depends a little on which company and what privacy settings she has cho- 
sen, but matching relatives is a central part of many genetics analysis sites. 
In many cases, it may even be a path to a new adventure into genealogy 
and research of who her recent ancestors were. 

Will Alice be interested in any findings from her health data? If the 
company that she chose includes health reports, then maybe she will. But 
hopefully both she and her family members will also take any increased 
genetic risk scores with a grain of salt. Since she is a grandmother and has 
already reached an old age, she obviously was not destined to suffer from 
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a terrible genetic disease. So, the usefulness of genetic health data from a 
direct to consumer testing company is probably of fairly limited interest to 
her, and there is little point in pursuing that further. She is happy enough 
with that. 

Alice will use the interface of the genetic company that she chose, and 
probably never leave onto third-party sites. However, it probably doesn’t 
matter too much if the company is myheritage, 23andme, ancestry.com, 
or another company. Perhaps the ancestry-focused sites are slightly more 
appropriate for her, but at the end of the day | don’t think there are hugely 
compelling reasons to say one is better than the other. I’m sure online 
forums will have plenty of opinions that say differently, but for a casual 
user like Alice it won't matter too much. 


3.2. Case B — Intermediate Level 


In our case study story the next person, “Bob”, also had a genetic test done. 
Bob had some biology in high school, but otherwise no special premises 
for understanding a genetic test. But he thinks it is very exciting to try out 
new things. Also, maybe a little scary, particularly the health aspects of it. 
Like his mother, Alice, he has gone through the online search, the payment 
and the tube-spitting in a return envelope. Now, a few months later, he 
receives a mail saying that his results are ready. 

At first log-in he reads the same ancestry results as Alice. They are 
even fairly similar, only of course with the genes of his father thrown into 
the mix. The unknown people that contact him and say they are related 
will typically have weaker connections. A 3rd cousin relation for Alice, for 
example, is barely detectable from Bob's DNA. At the family dinner Bob 
and Alice can probably talk a lot about this and compare notes, maybe 
even help each other with their genealogy research. 

For Bob, the father, the big thing he must consider is health reports. 
Does he want to have a look at them? They may have a real impact on 
his life and it definitely will matter how they are presented. Bob is good 
at computers, but we don't expect him to go out downloading raw data 
and visiting complicated third party analytics websites. So, for Bob, the 
choice of genetic testing company does matter a lot. Because he’s going 
to use the interpretation that is given by that company only. This probably 
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means that for a health-interested person, the tests sold by 23andme are 
recommendable. That's because these are the ones currently approved 
by the health authorities, at least in the USA. This is likely to change 
and expand in the future, so my point is not to recommend a specific 
company — but rather to note that the choice of DNA testing company 
is a very important one, particularly for an intermediate level user that is 
interested in the full package of information given from genetics, including 
health information. 

Bob did make that important choice and he is now browsing through 
pages that tells him about risks and dangers to his own health. All based on 
a genetic test from his own spit. This may be very scary reading, with find- 
ings of immediate relevance. Maybe he has Alzheimer mutations; or maybe 
he is the carrier of early Parkinson’s mutations. Maybe any future unborn 
children are at risk because he is the carrier of rare recessive mutation. 
For Bob, at this point, it is strongly recommended to re-read Chapter 5 in 
this book, and carefully consider the meaning of “increased risks”. Careful 
and accurate communication of increased risk, if any, is a supremely difficult 
subject. But learning more about the numbers and assumptions behind 
these calculations is going to be a very wise move for him. Particularly 
before he goes and tells the entire rest of the family about his findings. 

All in all, Bob is content with what he found. It hopefully didn’t change 
his world fundamentally or make him unnecessarily anxious. But even if it 
did, hopefully, the findings were ones that gave him a benefit: an important 
medical insight that he could actually use for something. 


3.3. Case C — Advanced Level 


Like Alice and Bob in our fictive-family case study story, the daughter, 
Carol, also had a genetic test done. Carol, however, is tech-savvy and she 
even studied mathematics and science at university. She wants to have a 
genetic test for sure, and when done — she absolutely wants to get every 
piece of information that she can possible get out of it. 

Registration, login, payment and spit sampling is routine for Carol. 
Getting the first result is also easy, and relatively eventless. Carol has the 
strong feeling that a lot of the information has been dumbed down so 
much that she doesn’t really feel that is meant for her. She therefore quickly 
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proceeds to the section marked “raw data download”. For Carol, it matters 
very little what genetic testing company she chose, because she knows 
that most of them anyway will serve her with a raw data file from what is 
called a microarray. She downloads that file. 

Nonetheless, Carol is overwhelmed when opening that file. Nobody, 
not even professional geneticists, read their data files raw. It’s just lines of 
As and Cs and Ts and Gs, interspersed with genomic coordinates. Such 
info has to be interpreted. Carol skilfully avoids a few of the most obvious 
interpretation scams, and proceeds to an overview site for third party 
software, such www.isogg.org or dnatestingchoice.com. There are more 
recommendations scattered around this book. 

I'd like to think that Carol immediately heads to the advanced analytics 
site that | have set up, www.impute.me — but on the other hand, I’d even 
more like this book to be a generally applicable handbook. So, let's just 
say she heads to a very advanced analytics site. Before doing that, Carol 
definitely needs to read up on Chapter 6 and the points about genetic 
risk scores. For all common diseases and traits, multi-SNP calculations are 
becoming the general rule, instead of single-SNP effects. We are simply 
just too complex organisms for a single SNP to decide our height, weight, 
intelligence or stroke risk. Carol understands this. 

Having understood that, it would also be of interest to Carol to revisit 
Chapter 2 and the section about missing heritability. This remains one of the 
main challenges of genetics, and even | find it hugely difficult to explain our 
thinking about the interplay between environment and genetics. Luckily for 
me, Carol is a smart woman and she probably understands it well herself 
now. The remaining challenge for Carol is to sit down at her family dinner 
and explain what she found, without alienating Bob and Alice. 


3.4. Data Generation Using Microarrays 


We have now seen how consumer genetics looks from the user perspec- 
tive. But how does it look behind the scenes? When analysing your own 
genome, the first important point to understand is the difference between 
the full sequence of DNA and what you get from most companies. As we 
discussed in Chapter 2, there is usually no reason to measure everything, 
when we can just measure the 0.1% that we know is variable between 
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humans. That is much cheaper, currently more scalable, and assuming that 
the remaining 99.9% is identical in all humans, it is correct. 

The technology to do this is called microarray and it is different from 
DNA sequencing in that it only measures positions of known variability; 
that is the genomic locations where we know at least some people have 
different alleles (i.e. SNPs). How it does this is a marvel of miniaturisation, 
where millions of different molecules (“probes”) have been attached to 
strict known positions on a small 2 x 2 cm plate (Fig. 3.1). Microarrays are 
the current workhorse of large-scale population genetic studies, they are 
designed to measure all the known common variation, and they are the 
product that (almost) all consumer genetics companies sell. 
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Fig. 3.1. The setup of a microarray. The figure is to scale on a logarithmic scale, meaning 
that 1 cm on the picture equals 1 um towards the bottom, but 10 m towards the top. 
The molecule-string make up something called a probe, which is a group of 25 letter long 
DNA-strands that can match with a particular region of human DNA to provide a signal for 
genotype. There are millions of these probes on a microarray. If the probe lights up with 
fluorescence we know that the specific sequence was present. A SNP will have a probe for 
each of its alleles and surrounding sequence. 
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You can recognise a microarray from the raw data files. They basically 
consist of rows of millions (700K — 2M) of data lines with SNP-ids, positions 
and two digit genotypes (A/G, G/G etc.). In other words, a mapping of 
your genome at all the positions where we know humans are genetically 
different. This is the prerequisite for most genetic analysis and this tech- 
nology has today become so cheap that the going price for a microarray 
is less than a hundred dollars. 


3.5. Data Generation Using DNA Sequencing 


If microarrays are a marvel of miniaturisation, then proper DNA sequencing 
is even more so. Like microarrays the price has also been dropping quickly 
and many predict it will eventually become even cheaper. However, today 
(2018), the price tag is closer to $1000 for a DNA sequence. It is worth 
noting that DNA sequencing as a term is often used a little inconsistently, 
particularly in the world of consumer genetics. But, assuming you didn't 
pay thousands of dollars, you probably got a microarray analysis, even 
though it was called DNA sequencing. 

Several recent initiatives that may change this includes those of 
Gencove.com, that promises to provide DNA sequencing at even cheaper 
prices than microarray. 

Likewise, the Dante Labs product claims to be pushing sequencing 
prices into the affordable range. Maybe when you read this, the tables 
have turned and sequencing is cheaper and better than microarray. But 
the trade-off is always quality vs price: You can already now have cheaper 
or better than microarray — but not both. So, for now, most consumer 
genomics are done with microarray. l'Il leave this space open for future 
updates as the world progresses. 

Anyway, if you have somehow obtained DNA sequencing, you will 
need to have that analysed as well. The most raw output of DNA sequenc- 
ing comes in something called a .fastq file. An extremely large file just with 
rows and rows of sequential A, C, T, and Gs. Any serious provider, however, 
would have pre-processed these somewhat for you. Subsequent formats 
are typically called .bam (the A, C, T, and Gs in their right genomic position) 
and .vcf (the highlight of just the things where your genome is different 
than the genome reference). 
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From the .vcf file onwards, the analysis and interpretation methodologies 
tend to merge with those of microarray — although with more opportunities 
in the realm of rare mutations. Therefore, the remaining chapters are equally 
relevant, just with this twist on price and initial pre-processing. 


3.6. Imputation: How Can it Help 


A concept often used with microarrays is the computational technique 
known as imputation. In the context of statistics, imputation means to 
fill in missing values based on knowledge of non-missing values. Beyond 
genetics, this typically has limited value — sometimes amounting to just 
filling in noise in the data. In human genetics, however, researchers can 
take advantage of some pretty neat tricks, based on the fact that we are 
all highly related not too many generations back. The exact mathematics 
is much beyond the scope of this book, but the idea is as follows: We 
know the full DNA sequence, including SNPs not on a microarray, in many 
people — for example the Thousand Genomes Project. SNPs that are 
located close to each other on the genome are often passed on from the 
same DNA strand, from parents to children. This phenomenon is called 
linkage disequilibrium. We can calculate exactly how often it happens. If 
one of a pair of SNPs is measured on a microarray, and another is not — 
then we at least know the probability of this happening. 

Through a series of complex calculations, it turns out that quite a 
few extra SNPs can then be guessed at a fairly high confidence. Even 
though they were not measured from the beginning. Typically, around 
4 million SNPs from a standard 700K SNP microarray; more if you accept 
less confidence. The question of confidence is the most important concept 
to understand. It's basically the probability the guess is correct. The imputed 
SNPs with confidence score of 0.98 to 1.0 — they are pretty much as good 
as it gets. Even direct non-imputed measurements also have a risk of error, 
unfortunately. But imputation will also, by default, report SNPs down to 0.9 
confidence — that means that the genotype you think you have actually 
is wrong 10% of the time. Not perfect, but it still depends on context. In a 
multi-SNP signature of hundreds of SNPs for example, it is better to accept 
this error but include as many SNPs as possible. 
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You can download and operate imputation software yourself. The 
most popular packages are called Mach 1.0 and impute2. However, they 
are highly technical and computationally intensive, so it is probably easier 
to use an online tool. Two tools that provide single sample imputation are 
DNA.land and www.impute.me — the latter is a project that | have been 
developing for this purpose. Both sites will take your standard microarray 
genotype file and process it with imputation tools, then return your data 
with millions of extra imputed SNP. At DNA.land, your data is also kept 
for use by the research group behind it. At www.impute.me your data is 
deleted after two weeks. | highly recommend using these tools in your 
own analysis — connecting to the latest genetic research is difficult to do 
otherwise, because by far the majority of current scientific studies also use 
imputation of their data. 


3.7. Other Data Generation Methods 


For completeness, it should be noted that there exist yet other methods 
of getting genomic data. The naming schemes of these can be a bit bewil- 
dering, with legacy names and brand names intermingling. However, if you 
are unsure if you got DNA sequencing or microarray and if you only have 
small file sizes (<O.5 MB), then you probably got an older technology. There 
exist companies out there that will report only on a few SNPs, typically less 
than ten SNPs. Be very careful with those. They probably employ legacy 
technologies such as polymerase chain reaction (PCR) based genotyping. 

While these older technologies are cheap and precise, they suffer 
from lack of coverage and scalability. So, typically, they will only be able to 
measure a few SNPs. With today’s more advanced technologies available, it 
is simply not worth it to buy any of these. Similarly, for the many companies 
that offer directed testing of e.g. sex chromosomes or specific genes: It is 
very unlikely to be a better deal than to just get the entire genome using 
one of the above mentioned complete techniques, microarray or DNA 
sequencing. That's because you can deduce the specific test from the 
general methods, whereas the opposite is not possible. 

Paternity tests are the one exception to this rule. You can easily 
establish paternity from both microarray and DNA sequencing. But, if you 
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want that to be legally binding, there are special regulations that dictate 
which test type you use, and these tests are of the older PCR-based type. 


3.8. Interpretation: How can my Genetics 
Company Know this? 


Your genomic data will be highly similar if you chose to have the data 
generation step performed at two different genetics companies (Hong 
et al., 2012). It should be, it is the same DNA. Not so for the interpretation, 
which can vary a lot. This is unfortunate but not surprising. The first thing 
to realise is that there is no secret source, no hidden cache of data that 
just a single company knows about. Every single genetics company derives 
their interpretations from the work of published scientific literature. | can 
write that so conclusively because of the sample sizes involved in genetics 
discovery. As we saw in Chapter 2, it takes a lot of research and resources 
to pull out knowledge from the genome. And, currently, there are only 
two companies that hold customer databases that surpass the size of the 
biggest public science projects: 23andme and ancestry.com. 

The first, 23andme, publishes their findings to a large extent — for 
example that hair colour study mentioned, it is a 23andme study (Eriksson 
et al., 2010). Even if they do have unpublished insights it doesn't seem to 
be part of their sales material. Ancestry.com focuses on matching relatives. 
We'll discuss that in Chapter 4. The point, however, is that every company in 
the consumer genetics market is somehow leaning on the scientific literature 
for their interpretation. Because that's all they can do. That's not to say they 
don’t provide an important role — the published scientific literature is a 
vast and bewildering place, and the finer details of what is of importance 
is the subject of a whole book (this book, Chapters 4, 5 and 6 specifically). 

It follows that you should only trust interpretations if they have a proper 
scientific reference. Any calculation without is not well-supported. However, 
even when supported by scientific studies, all evidence is not equal. So, 
inevitably, you must consider the problem of different data interpretations 
based on different findings. A few rules of thumb are nice to know: First, 
if something has been found and described in many scientific articles it is 
more likely to be true. The ACTN3 sprint-SNPs for example; that have been 
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investigated in more than 100 published scientific articles (Ma et al., 2013). 
It is likely to be true. Secondly, larger studies are better. Large studies in 
genetics, that means thousands of individuals including in the research. 
The complex disease GWAS we talked about in Chapter 2 now typically 
involve up to hundred thousand individuals. If a genetic effect is found in 
an article testing 50 people, but the effect is not reproduced in an article 
that tested ten thousand — then the effect is probably not a trustworthy 
finding. This is further discussed later, but basically use common sense, 
and if something is particularly important to you, make sure to cross check 
as much as possible. 


3.9. Interpretation — Taking Matters Into 
Your Own Hands 


Common sense and cross-checking data — this essentially means taking 
control of your own data and starting to explore the internet of science 
yourself. Even if one company does a good job of interpretation, more 
information is available if you use many sources. This was illustrated in the 
advanced case study scenario in the beginning of this chapter. Luckily, most 
consumer genomics providers also offer the download of the underlying 
data, so you are free to do this. This is true for all the major providers: 
23andme, ancestry.com, myheritage and hopefully also new and upcom- 


ing consumer genetics providers. Sometimes you'll have to click around 
a little. In 23andme, for example, it's hidden behind 5 clicks of navigation 
menus and disclaimers. But overall — having your own data is certain to 
give you a much deeper glimpse into your own genetics. The specific 
approaches depend on the question you are most interested in. The next 
three chapters will further explore each of them: ancestry, rare diseases, 
and common diseases. 
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Ancestry and Genealogy 


The most popular use of consumer genetics testing is without doubt ancestry 
testing. People, particularly in the USA, want to know where they come from 
and the genetics companies with quickest growth are the ones that focus 
on ancestry. It makes sense; DNA is in many ways the best tool for that 
question — where we come from. Amazing studies where researchers have 
been able to determine people’s origin down to few hundred kilometres 
has been published, basically creating maps of whole continents with 
exact locations of countries, all based solely on genetics input (Novembre 
et al., 2008). But how we use that tool, particularly in connection with other 
approaches, e.g. classical genealogy, is something that can benefit from 
a little more explanation. 


4.1. Not Just Where you Come From — 
but Also When 


Where do we come from? This is the question that all ancestry genetics 
companies claim to solve. They do a great job in many cases. But | can’t 
help but wonder why nobody seems to be asking the next obvious follow-up 
question: When do you mean? Because obviously “where you come”, for 
you, that's written in your birth certificate, if you were born in Copenhagen 
you come from Copenhagen and if you were born in Shanghai then you 
come from Shanghai, according to the meaning that it is your place of birth. 

If instead the meaning is where your ancestors were born — well, then 
the answer is Africa. Any human tracing his or her lineage sufficiently long 
back will eventually find some early ancestor living on the great plains of 
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Fig. 4.1. Rift Valley, Africa. One possible answer to the question of where we came from. 


eastern Africa, probably around the Rift Valley area (Fig. 4.1). And before 
that our ancestors were even older lineages of the hominid family, drifting 
around in the great game of evolution. 

But obviously neither Africa nor your birth hospital is what people 
generally mean when they ask where they come from. They mean their 
great-grand-parents, or they mean their great-great-great-great-grand- 
parents, or they mean someone in the early medieval times. Someone 
in historical time that they don’t already know, someone they can tell a 
story about. And, for that reason, one main product of ancestry genetics 
is overviews of percentages distributed by country: 45.5% Scandinavian, 
29.0% Broadly Northwestern European etc. (my own data). Actually, 
| think it can be a little boring to read for people who are not from recent 
immigrant families, which may explain the particular popularity of the 
tests in USA. 

Let's return to the main question — when in history is this “fixed”? 
To understand we need to know a little about both how the results are 
calculated and also about human migration history. 
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Fig. 4.2. Migration and invasion patterns at the time of the late Roman Empire (100-500 CE). 


Humans have migrated around the world throughout history. An indi- 
vidual human may usually not have moved much from his or her birthplace, 
but to the beat of generations, literally hordes of people shifted around — 
as is well exemplified by this illustration of major migrations in Europe 
around the time of the late Roman Empire (Fig. 4.2, data from Wikipedia). 

Are these migrations included in your ancestry report? Could you 
be told that you are 42% Visigoth and incidentally a cause of the fall of 
Rome? Or that your ancestors rode to Europe from Asia with the hordes 
of Genghis Khan? 


4.2. Calculating Country Percentages 


We don't get late classical ancestry information in our DNA reports, and the 
reason is the way your percentage is calculated. To calculate, one needs to 
calibrate with data from many other people, all of whom must have known 
ancestry and known genomes. A typical definition of known ancestry can 
be having all your grandparents come from one country (which is not USA), 
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or some similar definition of “local”. Another definition could be members 
of a currently existing group or tribe as defined by anthropological studies, 
for example Ashkenazi Jewish. 

The people with known ancestry and known genomes will then make 
up a set of comparison groups. Thus, we may have groups with people that 
are known Finnish, known British, known Chinese, etc. Genomic segment 
by genomic segment your own genome is then compared to the sequence 
for these reference groups and a best guess group is assigned. My 45.5% 
Scandinavian, for example, is given because 45.5% of my genome was of 
sufficiently high similarity to a reference group of people from Scandinavia, 
all of whom had known ancestry and known genomes. My “Broadly North- 
western European” guess was then probably just segments of the DNA 
sequence where people from different Northwestern European countries 
were too similar to perform a more precise guess. 

From this you can deduce that since there are no modern day defined 
groups of Visigoths and also no more members of the Golden Horde, then 
we unfortunately cannot get such exact answers. We can only reflect our 
ancestry composition on modern day known groups, or at least groups of 
people from which we can both get DNA and ancestry information. 

Maybe these opportunities will be expanded in the future, by the 
works of palaeogeneticists: scientists studying ancient DNA (Rasmussen 
et al., 2010; Green et al., 2006). But for now, your country percentage 
just indicates that your genome looks like that of people currently or very 
recently living in a specific country. 

It also follows that the setup of the “known ancestry” regions are 
really important. Not only are these geographical regions different between 
different companies, but they will also have different definition groups and 
give different results.! For example, | have noted that more than half of 
Danish people that | know taking ancestry tests find that they have ~20% 
British ancestry. There is no good historical record of recent or historical 
large influx of British immigration that can explain this. If anything, it has 
historically been the opposite, both now and in the Viking era. However, 
it illustrates a general fallacy of the comparison setup: if a larger number 


'For an interesting actual comparison of this, | recommend the blog post by Judy G. Russell 
found at this web address http://www.legalgenealogist.com/2017/04/1 6/still-not-soup/ 
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of British take ancestry tests and self-report as British, and a fraction of 
them have admixed genomes from neighbouring countries, such as Viking 
invaders from Scandinavia — then that will mistakenly be carried over into 
ancestry reporting. 

How does this uncertainty compare with the claims that genetics 
can determine your ancestry to within 100 km? Genetics — almost by 
definition — is the best tool for unravelling where your genes come from. 
But it is important to be aware of the limitations imposed by the constant 
dilution and mixing of generations of gene shuffling and migrations, par- 
ticularly when taking the long-term perspective of the millennia: If your 
ancestors came from the same small farm in Finland since the time of the 
Vikings; then genetics will nail down your origins to the postal code. If, on 
the other hand, your ancestors roamed the globe — particularly in ways 
that were unique and cannot otherwise be defined as discrete ancestry 
groups, then it is very difficult to say anything at a finer resolution than 
continent level. And, therefore, ancestry reports tend to paint with broad 
defined geographical strokes or defined-group strokes, such as Northern 
Europe or Ashkenazi Jewish. 


4.3. Thousand Type Genomes and 
a Single Mixed One 


To illustrate how this best guess group assignment works in practice we 
will work through a practical example. In Chapter 2 we introduced the 
Thousand Genomes Project which is a publically available data set with 
genomic data from thousands of individuals around the world. Importantly, 
these individuals were all selected because of their clearly identified eth- 
nicities: one group was Scandinavian immigrants in Utah, one group was 
Sri Lankan Tamil in UK, one group was from the Yoruba ethnic group in 
Nigeria, and so on. 

Altogether 29 such groups from around the world were included. 
All of them were selected for the reasons already outlined; that they are 
well-defined groups of currently existing individuals. We will add one person 
to this analysis — my daughter, five years old, of mixed Chinese—Danish 
ancestry. 
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Fig. 4.3. Ancestry analysis of Asian and European ancestry individual. Left: typical report- 
ing scheme, percentages from regions and countries. Here given for one half Chinese half 
Danish descent individual, my daughter. Right: more detailed clustering analysis of the same 
combined with genomic data from all participants of the Thousand Genomes Project. Each 
coloured dot shows individuals from that study, with colour as known ancestry and location 
as genetically established ancestry. The black dot then shows basically the same as the pie 
chart on the left: a genetic location half way between European and East Asian. But it addi- 
tionally illustrates the components that drives both types of analysis: reference groups with 
known ancestry and known genomes. Note that the plot is actually in 3D, so American and 
African ethnicities are further away than it appears in print. The 3D version can be further 
explored at www.impute.me/ethnicity. 


To calculate a best guess ancestry for her | fed all this data into a 
clustering algorithm, and asked the computer to put similar genomes 
closer together than non-similar ones (Fig. 4.3). Incidentally, the same 
algorithm as the one used for detail ancestry maps, something called 
principal components analysis (Novembre et al., 2008). In the figure, each 
of the thousand members of the Thousand Genomes Project are shown 
in colour; they had defined ancestry. 

The black dot in the middle is my daughter: exactly halfway between 
the Scandinavian groups and the Chinese ancestry groups. For comparison, 
my own dot would be hovering inside a cloud of other known European 
individuals (Fig. 4.3), generally closer to “Scandinavians from Utah”, with a 
few British and Finnish close as well. You can re-create this analysis with your 
own data using the analysis module found here: www.impute.me/ethnicity. 

So, the computer guessed right. It put me as European, my wife 
as Chinese and our daughter halfway between the two of us. This is not 
even the most fine-grained algorithm available, because | just clumped 


Ancestry and Genealogy 35 


the entire genome together in one analysis. Most ancestry algorithms will 
calculate segment by segment of the genome, and get finer resolution. 
But the principle of defined group vs unknown sample is clearly illustrated, 
and so are the limitations. 


4.4. Fine-grained Tools that Won't 
Dilute with Sex 


The limitation in the above fundamentally happens for two reasons: that 
the DNA content of your non-sex (autosomal) chromosomes is mixed at 
every generation. And that this also happens for everyone else. On the 
Y chromosome, this does not happen. If a man fathers a son, the son's Y 
chromosome will be identical to the father’s. Only few and rare spontaneous 
mutations will pick up over the generations. And, for this reason, one can 
use the Y chromosome to trace paternal lineages of descent without the 
autosomal dilution problem. 

If two men have precisely the same sequence of DNA on their Y 
chromosome it can be deduced that they descend from the same father 
at some point in the past. This point in the past is important; since all 
humans came from a relatively small number of African hunter-gatherers, 
this could be true for almost everyone. But rare spontaneous mutations 
do happen; we know of approximately 65,000 genetic variations in the Y 
chromosome and, therefore, we can trace the paternal history and order 
each into families. The time scale of this is much older than the country 
percentages discussed above. Most major Y chromosome splits happened 
tens of thousands of years ago (Poznik et al., 2016). 

Like the Y chromosome is passed on from father to son — so is the 
mitochondrial DNA passed on from mother to daughter. It is also passed 
on from a mother to her son, but the unmixed passing on is otherwise 
identical in concept. Mitochondria are little bio-power generators that 
each of your cells have, and for some reason they have their own short 
pieces of DNA. These pieces are passed to the next generation only 
through the maternal line. And exactly like the Y chromosome and 
paternal lineages, we can create maternal-only family trees describing 
the history of splits and migrations of this mitochondrial DNA (Oven 
MV & Kayser, 2009). 
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The insights you can gain from analysis of Y and mitochondrial 
chromosomes mostly concern the locations of your maternal and pater- 
nal ancestors a very long time ago, back in the ice age. The advantage, 
however, is that the findings are more clear cut: If you find that you have 
a Y chromosome type from the other end of the world, then it's a much 
more solid finding than discovering a few percent of odd country origin in 
autosomal genome analysis. This is because this type of analysis doesn’t 
dilute with generations of mixing.? 

It is also possible that very fine-grained analysis, particularly using 
DNA sequencing, can improve on this in the future, but since such infor- 
mation is currently not possible with most consumer genetics microarrays 
l'Il leave that as out of scope. 


4.5. Finding Your Relatives. How and Where? 


The third type of genetic ancestry research that is of interest is the pairwise 
match search. In this approach, we simply ask if anyone else has a genome 
that is similar to our own. So, the main input is your own genomic data, 
as well as a large online database of other genomes. When comparing 
two genomes it's trivially easy to identify if they are from twins, parents or 
children. They look much more similar than you'd expect by chance: for 
(monozygotic) twins they are identical and for parent-children comparisons 
they are half identical — i.e. one copy of the DNA is completely identical. 
For fraternal twins and siblings, it becomes slightly more complex: on aver- 
age 50% of all DNA stretches will be completely identical, but in theory 
this number could be anything from all to none. 

This uncertainty arises because of the random selection of which chro- 
mosome is passed on from each parent and it explains how the uncertainty 
grows when determining relatedness of more and more distant relatives. 
This percentage decreases with the distance of the relation. Uncles, aunts, 
nieces, nephews, grandchildren and grandparents have 25% identical DNA 


For an interesting investigation of precisely such out of the expected mitochondrial DNA 
category | can recommend the blog post by Razib Khan, wherein he investigates the roots 
of his maternal lineage https://gnxp.nofe.me/2017/10/07/the-tibeto-burman-and-austro- 
asiatic-ancestry-of-bengalis 
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stretches on average. Further out, first cousins are expected at 12.5%, 
second cousins at 3.125%, and third cousins at 0.78%. The expected per- 
centages are indicated by each family member in Fig. 4.4. 

The smart thing is that we can measure the actual percentage through 
DNA analysis, and in this way, deduce how closely related two individuals 
are from the expected percentages — even though a relation was not 
previously known. This is the principle of any DNA-based relationship test, 
including paternity tests. 

A challenge is that beyond parent-child relations, the expected per 
centages are just averages. Some brothers and sisters may have 55%, some 
may have 45% shared DNA stretches. The actual percentage can only be 
known through measurement, and this sets a limitation on how precisely 
it can be used. For example, | have 61.6% identical DNA stretches with 
my sister (expected 50%) and my daughter has 14.7% with my grandfather 
(expected 12.5%). A value of 14.7% is quite likely to reflect a “level 12.5% 
relation”, e.g. either a great grandchild, a first cousin or a grand-niece. And 
61.6% is easily interpreted as sibling — what else could it reasonably be? 

So, the test turns out correct in both cases. But the precision can 
become a real problem out around second cousins and beyond, so third 
cousin and fourth cousin identifications should be assumed to have some 
uncertainty in them. Additionally, we do not get any information about being 
from the same generation or not. A grand-niece and a first cousin have the 
same expected level of similarity, 12.5%, even though the grand-niece is 
from a younger generation (Fig. 4.4). This can be quite a challenge when 
trying to assemble family trees. Having DNA samples from older members 
of families really helps the interpretation a lot. 

Finally, notice how we talk about percent of DNA stretches that are 
identical, not about percent of all DNA that is identical. That's because at 
the nucleotide letter level all humans are >99.9% identical anyway. Only the 
variable regions differ (SNPs etc.), as described in Chapter 2. The terminol- 
ogy used here instead refers to the fraction of longer DNA stretches that are 
identical: Stretches of millions of nucleotide letters, including every single 
SNP in the region, must be completely identical to qualify for this metric. 

In unrelated individuals, there usually are no long DNA stretches where 
at least one SNP is not different. | suppose one could also calculate the 
percentages as e.g. siblings being 99.99% identical, cousins 99.95% and 
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12.5% 
Great 

Grandparents 

25% 12.5% 

Grandparents Great aunts 

Great uncles 

50% 25% 6.1% 

Parents Aunts First cousins 

Uncles Once removed 

100% 50% 12.5% 3.1% 
Self Sisters First cousins Second cousins 

Identical Twin Brothers 

50% 25% 6.1% 1.5% 
Children Nieces First cousins Second cousins 
Nephews Once removed Once removed 

25% 12.5% 3.1% 0.7% 


Grand Grand nieces First cousins Second cousins 
Children Grand nephews Twice removed Twice removed 
Fig. 4.4. Overview and definition of family relation terminology: the first, second, third 


cousin and once- twice- removed terminology. By each box, the typical percentage of iden- 
tical DNA stretches is indicated. 


third cousins as 99.91% identical on nucleotide level. But that would create 
other difficulties in understanding, not least because it would depend on 
the background and ethnicity of every single other person in the family tree. 
For genealogy purposes the 0-100% “shared DNA-stretches” terminology 
is therefore most useful. 

If your goal is to find unknown relatives, | think that the best overall 
advice on choice of genome database is to just test as many as possible. 
That way you'll have a much broader view of every potential relative, no 
matter their choice of DNA data provider. Ancestry.com or MyHeritage 
are obvious choices, but 23andme, DNAGedcom and DNA. land also have 
extensive databases with genealogy sharing databases. Further, if you are 
interested in some of the other metrics, like for example how many percent 


of DNA stretches are identical between you and your siblings, then it's 
recommendable that you both use the same genetics company, because 
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then the calculations are automatically provided. Be aware of the precision 
limitations when contacting people indicated as third or fourth cousins. 
These signals may also have happened by chance. 

The typical next step is to start with regular registry-based geneal- 
ogy and see if you can determine the precise link, finding out who exactly 
your last common ancestor is. In this it really helps to have DNA samples 
from your older living relatives, as it can be used to check if a hypothesis 
is correct. Also, the signal for last common ancestor is obviously stronger 
the closer to that ancestor a person is. 

My own grandfather (morfar) is very skilled at genealogy research 
based on church books. Together we have mapped the family tree back 
to the 18th century along most branches. This is a very useful resource 
to combine with genetic ancestry research, because it allows comparison 
with the family trees of other people, whenever a genetic match is made. 

For our family, this has led to contact with some of the Scandinavian 
emigrants that currently live in Utah. Because of their religion, they have 
become experts in church book based genealogy, and we have found a large 
branch of the Folkersen-clan living in Salt Lake City. From that information, 
| have had it as a goal to find the genetic link, using all the concepts that 
we have now covered. 

However, because our most recent ancestors are at least 6-7 gener- 
ations back it was virtually impossible for me to pick out the genetic links. 
In their most recent Christmas greeting, however, my 6th cousin Brad 
Folkersen wrote me that | shouldn't worry about this. Because our spiritual 
genetics anyway was vastly more important than our mortal genetics. 
Brad is a member of the Church of Latter Day Saints and he is as avid a 
genealogist as | am. 
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Rare Disease Genetics — 
One SNP ata Time 


A rare genetic disease is a disease that is predominantly genetic, i.e. having 
a very high heritability as defined in Chapter 2. They have names such as 
Cystic Fibrosis, Huntington’s disease, Haemophilia, Familial Hypercholes- 
terolemia, Niemann Pick Disease, Phenylketonuria, Tay Sachs Disease and 
many others. A lot more in fact; the list is so long that most people — many 
geneticists included — have never heard about most of them. That's because 
these diseases typically also are very rare. Only a few people suffer from any 
one of them specifically. But defined as a group, they nonetheless make up 
a large burden of disease in humans. As described in Chapter 2, we have 
often discovered the cause of these diseases using family or linkage type 
studies. Typically, they are caused by complete disruption of just a single 
gene and another name for the group is therefore monogenic disorders. 

From the perspective of personal genomics, the most important 
thing to know is that if you are reading this, you likely don’t have one of 
these diseases without already knowing it. That’s because they also have 
very early age of onset — they are typically discovered at birth or during 
childhood. This is a consequence of the severity of completely disrupting 
a gene. It may also be the cause of their rarity; they are often such severe 
handicaps that natural selection must have been exerting its pressure. If 
it is more difficult to grow up and raise a family, it is more unlikely that 
the disease-causing SNP is passed on. However, they exist, and thus the 
primary analysis goal in this chapter is to find out if you are a carrier of the 
disease variant. 

Being a carrier of a genetic disease, as opposed to actively suffering 
from the disease, is often a case of recessive inheritance. Recessive is 
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defined as the need for two copies of a disease allele to observe any effect. 
So, for example having an A/A or A/G genotype may result in no problems, 
but having a G/G genotype will result in some differing effect. That is an 
example of a recessive effect, and “effect” in this case is the rare genetic 
disease. It follows that people can live their life with only one G and be 
perfectly healthy. If you are an adult, with no known rare genetic disease 
and you are reading this — your primary interest in rare disease genetics 
probably is to find out if you are a carrier for anything, and if so — find out 
about risks and inheritance patterns. 


5.1. Known and Unknown SNPs for 
Rare Genetic Disease 


Finding out if you are a carrier for anything is a surprisingly complicated 
question in personal genetics analysis. It is complicated because of the 
usage of microarrays, and in this aspect the shortcomings of microarrays are 
very clear. The complication arises because any nucleotide letter change in 
a gene potentially can be the one destroying it. If a DNA sequence change 
destroys a gene or not is decided by the mechanics of translating genes 
into protein. Some simple changes in one nucleotide will not necessarily 
give problems. Others — particularly those that insert or delete a nucleo- 
tide in a gene — are always destructive. Note that while a SNP is a type of 
DNA change, more severe types of change also exists, and | therefore write 
“mutations” here. This is to underscore the point that these mutations may 
be common or rare, known or novel, but only the common mutations — the 
SNPs — are measured on a microarray. Therefore, a microarray is blind to a 
large part of the potential DNA problems that can give rise to rare genetic 
disease. It is not blind to everything, so a better analogy is that a microarray 
is like looking at your genome through a fixed sieve. We can detect known 
disease-causing SNPs — and there are many — but we can never exclude 
all the unknown or rare mutations that may also destroy genes. We can 
improve slightly on the investigation space by using imputation analysis 
of microarrays, but even then, there are many potentially gene-disrupting 
mutations that we cannot observe and that could cause rare genetic disease. 

This is a central criticism in the usage of microarrays over DNA 
sequencing. The findings you get likely are true (true “positive” findings), 
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but you can never really exclude that there is nothing wrong because of 
the findings you don't get (false “negative” findings). It is a correct criticism 
and if your main goal is a worry about rare genetic disease, then you should 
consider paying for a full DNA sequencing. If not, you can still peek at the 
domain of rare genetic disease, but you cannot make an exhaustive search. 


5.2. Inheritance Patterns — Unborn 
Children and Risk 


Either way, if you have already discovered that you carry a known disease 
allele for a rare genetic disease SNP, then the microarray vs DNA sequenc- 
ing debate is irrelevant. Because then your disease SNP happened to be 
measured on the microarray and it is a “true” positive. In that case, the 
next step is to understand the inheritance patterns. This is the part about 
recessive mutations and about your risks of passing on the full disease to 
your children. 

There are four kinds of inheritance that are relevant here: Recessive 
autosomal, recessive sex-linked, dominant autosomal and dominant sex- 
linked. There exist more types, e.g. additive, but they are more relevant 
in Chapter 6 when discussing common trait genetics. The four types are 
combinations of “recessive/dominant” and “autosomal/sex-linked”. The 
first is if you have one G and having two Gs doesn't add more effect (dom- 
inant) or if you absolutely need two Gs to observe any effect (recessive). 
The second — autosomal/sex-linked — refers to what chromosome the 
SNP is found on. Another name for chromosomes 1 through 22 is “auto- 
somal”. That word basically just means the non-sex chromosomes. But it’s 
important to distinguish because the inheritance pattern is very different 
on the sex chromosomes. 

The recessive autosomal is the easiest to explain and the most com- 
mon, so we start with that. Let's keep with the example that G for a given 
SNP will cause disease, if both your copies are of the G-type. If you find 
that you are A/G for that SNP, it means you are a carrier. You will pass on 
the G to your children with 50% chance. If your children get a G from you, 
but an A from your spouse — they too will become carriers, but not sick. 
However, if they get a G both from you and your spouse they will become 
sick. This is the problem we are trying to calculate. So, you have to know the 
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genotype of your spouse as well. If your spouse is A/A there is no chance 
that he or she passes on any Gs: 50% chance that your child becomes a 
non-carrier and 50% chance that your child becomes a carrier like you. 
However, if your spouse turns out to also be a carrier, A/G, then he or she 
also has a 50% risk of passing on a G. If you do the maths it follows that 
your child will have 25% probability of becoming non-carrier (A/A), 50% 
probability of becoming healthy but carrier (A/G), and 25% probability of 
getting the disease (G/G). If both you and your spouse are carriers and 
the disease is severe, these last 25% makes it important to seek further 
medical counsel. 

The dominant autosomal case is different, because here you or your 
spouse would already know you had the disease. Alternatively, a novel 
mutation could have happened. Because it only takes one G. With no G 
it's not possible to pass on. Therefore, these cases are rare to discover 
through a personal genetics test — with the exception of breast cancer, 
which is further described later in this chapter. But the mathematics could 
be worked out similarly to the recessive case. As an interesting point, it 
is worth considering that dominant traits are just the reverse of recessive 
traits. If G is the recessive disease allele for a trait, then having at least one 
A allele makes you healthy, so in a sense the A allele for the recessive SNP 
is “dominant” — but for the healthy trait. For rare diseases, we of course 
don't consider “healthy” a trait, so recessive is the word that is used. How- 
ever, it is worth keeping in mind for common binary traits like European 
vs Asian hair colour type, where the recessive/dominant terminology can 
otherwise be confusing. 

The sex-linked inheritance concerns SNPs found on the X chromosome. 
A good example is colour blindness. The reason it is special is because 
men only have one X chromosome. Recessive more precisely means “only 
in effect if no healthy variant exists”. But, since men only have one X chro- 
mosome, only one G variant would be sufficient for disease to happen. 
To make the risk calculations you need to think like this: if you are female, 
and you have one X chromosomal “G” for a disease SNP, then your prob- 
ability of passing it on is 50% as in the autosomal case. But if your child 
is a boy, this X chromosome would be the only one he has — he’s XY, i.e. 
male. If your child is a girl, she’d also get a healthy X chromosome from 
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her father (again assuming none of you have the actual disease). So, the 
probabilities are: 25% baby girl, non-carrier; 25% baby girl, carrier; 25% 
baby boy, non-carrier; 25% baby boy, sick. Taken together, such a setup 
has a 25% probability of resulting in an affected child. Conversely, if you 
are male, you either have the disease already or else you don't have it and 
cannot be a carrier either. 

This may be a lot of mathematics. It'll become even more complicated 
in Chapter 6. If you wish to understand it in more depth, | can highly rec- 
ommend further online sources for general genetics knowledge, such as 
for example www.my46.org. The focus here remains on personal genome 
analysis, and the key point is that for rare genetic disease it is usually only 
really relevant to further investigate if it happens that both you and your 
spouse are carriers for the same disease SNP allele. 


5.3. A Panel of FDA Approved Rare 
Genetic Disease SNPs 


Of relevance to rare genetic disease is the story of the medical reporting of 
23andme. These two things are tied together, because of health authority 
demand for easily interpretable genetic results. Before November 2013, 
customers of 23andme were offered a range of medical interpretations 
similar to what you find elsewhere today in unregulated form, encompass- 
ing both common and rare genetic disease (the “C” case study example 
in Chapter 3). 

But 23andme was a large company and the US Food and Drug 
Administration (FDA) decided to send a warning letter to them, effectively 
prohibiting the company from delivering any medical interpretations. 
After that, the company took the strategy of gradually winning back the 
FDA permission to deliver medical interpretations, one disease at a time. 
A main point of the warning letter was the uncertain nature of the disease 
interpretation, and therefore it made a lot of sense that the first FDA per- 
missions they applied for would be rare genetic disease. That's because 
even though they are rare, the SNPs have high effect size — effectively 
meaning that you can be confident that disease follows the presence of a 
disease-associated SNP. 
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Since then the following seven rare genetic diseases have been 
granted FDA permission and are reported to customers, as well as pro- 
moted in the 23andme marketing material: Aloha-1 antitrypsin deficiency, 
Early-onset primary dystonia, Factor XI deficiency, Gaucher disease type 
1, Glucose-6-phosphate dehydrogenase deficiency, Hereditary hemochro- 
matosis, and Hereditary thrombophilia. At first, they were called carrier 
reports, meaning that they gave no information about disease, just if you 
had any disease allele. Now they are approved to tell about disease status, 
and have been joined by the more common diseases of Parkinson's and 
Alzheimer’s. These two later are more common, but the SNPs tested are 
still high effect size which is presumably the reason they were next in line 
for approval. We can expect to see more similar approvals. 


5.4. Interpretations Beyond the FDA 
Approved SNPs 


You may want to stop your analysis at FDA approved SNPs or you may 
want to go beyond this. My own opinion is that the FDA had good reasons 
for intervening, but that the reason was mainly a concern that the level of 
understanding was too low in the general population. Therefore, if you 
wish to educate yourself and interpret your data in an objective manner, 
knowing that not all genetic findings are clear cut yes or no cases — then by 
all means you should go beyond. The underlying data generation quality is 
not different between the FDA approved traits and the non-FDA approved 
traits. Only the interpretation will vary. 

If you choose to proceed, your next stop is probably an internet 
search. There you'll quickly find that the number of options and links are 
bewildering. The market for DNA analysis and interpretation is growing 
extremely fast. You'll probably also very quickly find that there's a very big 
difference between quality of interpretation sites and their visibility online. 

One of the first tools that | would recommend is SNPedia (www.snpe- 
dia.com). SNPedia is the data repository for the popular Promethease tool. 
SNPedia is a wiki-style resource documenting almost a hundred thousand 
SNPs, complete with links to supporting scientific literature. The Promet- 
hease tool can intersect your own genetics data with this data in order to 
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annotate SNPs. Overall, SNPedia is a great resource for getting a broad 
view of your own genome. As a wiki, this is obviously the polar opposite 
of a stringent FDA-approved SNP test. It is fairly uncontrolled and broad, 
with emphasis on inclusion of data. 

Other tools such as codegen.eu, 24genetics, geneplaza, helix.com, 
impute.me etc. may also be of use. They accept third party data. And, if 
you really want to dive deep into the full overview of tools, the International 
Society of Genetic Genealogy Wiki has a large number of suggestions that 
go beyond genealogy (www.isogg.org/wiki/Autosomal_DNA tools), many 
of them useful for specialised purposes. 


5.5. Common Sense and How to Cautiously 
Interpret Findings 


For any genetic finding of interest, you should ask a few key questions. This 
is true no matter how or why you find it — the trait or disease may have 
mattered particularly to you, or an analysis service may have highlighted 
it: No matter how, if a SNP result is important to you, you should always 
ask these common sense questions about it first. 

The first question is: is the trait really likely to be driven by only a single 


SNP? Rare traits satisfy this criterion. Traits where recent literature discusses 
them as single gene traits satisfy this criterion. Also, some SNPs with very 
high effect size, such as the ApoE-e4 Alzheimer’s association. It increases 
your risks so much that it is justified to interpret it on its own. However, if 
the trait is not likely to be driven by this single SNP, | recommend that you 
skip forward to Chapter 6 and use more SNPs in your analysis of that trait. 

The second question is: how sure can you be that something is a 
real reproducible finding? Many analytics sites, including codegen.eu, 
23andme and SNPedia will give some kind of confidence metric, like a 
number of stars. If a trait or disease is important to you, | recommend 
you double check this a little. As discussed, everything comes from 
published scientific articles and therefore such check-up inevitably 
leads to primary scientific sources. That can be a bit daunting. The 
rule of thumb is that more studies, better; larger studies, better; newer 
studies, better. In that order. Precisely how many, how large and how 
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new inter-depends, so if there are a few large studies or a hundred tiny 
family studies saying that a SNP is associated with a given disease or 
trait, that's both fine. But some very rough numbers to go with is that 
if you can find only one or two studies on something, and the sample 
sizes are in the hundreds — then don’t trust it too much. Peer-reviewed 
scientific articles can exaggerate. 


5.6. Effect Sizes and How a Single SNP Matter 
to you as an Individual 


Once you have established that something seems to be both relevant and 
correct, the third question arises: how much is this going to affect you? To 
understand this, there is no way around going into some fairly mathematical 
considerations. If you wish to skip that, just note that the quantity called 
odds ratio — or OR — approximately corresponds to fold-change. You can 
think of 1.5 OR as having 50% more risk, roughly speaking. 

When you try to understand measured binary traits, however, it is 
very useful with a deeper understanding of the OR metric. It's only used 
for binary — i.e. yes/no things — like having a disease or not. Numerical 
traits, like blood pressure or height are further discussed in Chapter 6, but 
all rare genetic disease in the scope of this chapter are yes/no concepts. 
And for such things OR is used. 

The formal definition of OR is this: the probability, as odds, of having 
a SNP allele (“G” instead of “A”), in people with disease vs the probability 
in people without disease. For example, if a SNP has 1.3 OR for a G allele 
for stroke, it means that for each G allele it is 1.3 times as likely for a person 
to be in the group of people who have a stroke. Typical values from GWAS 
SNPs range from 1.1 to 1.5 OR. Some particularly strong genetic variants 
have values up to 3.5 OR, such as the much-discussed Alzheimer’s APOE 
SNPs (Kamboh et al., 2012). Non-genetic effects, such as smoking, can 
also be said to have an odds ratio. For example, the odds ratio for lung 
cancer if you smoke cigarettes is approximately 16 OR. 

Doubling and tripling in probability may sound alarming, but it is 
important to see them in context of the disease risk. If we measure the 
genotypes of 15,000 people that we already know have coronary heart 
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disease and compare them with 15,000 healthy people, the risk allele for 
a common SNP with 1.3 OR will be distributed as follows: 


OR = 1.3 Have risk allele Do not have risk allele 


Have disease 8,000 7,000 
Do not have disease 7,000 8,000 


These numbers correspond to the SNP with the strongest known 
association with cardiovascular disease (“rs4977574"), and the number 
of people is close to what was found in the first large population studies 
of cardiovascular disease (Wellcome Trust, 2007; Folkersen et al., 2009). 

So clearly the risk allele of this SNP increases your odds of having 
the disease — among the people with the risk allele a thousand more 
did suffer from disease. Among the people without the risk allele a 
thousand fewer had disease. But still, 7,000 with the risk allele did not 
have disease, and another 7,000 without the risk allele nonetheless did 
develop disease. 

However, we also know that regardless of genetics, approximately 0.1% 
of people between 35 and 74 years old will die from coronary heart disease 
each year (CDC 2011). If the study had instead included a completely ran- 
dom group of ten thousand people 35 to 74 years of age (meaning that we 
did not specifically ask people with disease to participate), the distribution 
would look approximately like this, according to the 0.1% yearly incidence 
rate of disease and according to a 1.3 OR risk SNP: 


OR = 1.3 Have risk allele Do not have risk allele 


Have disease 6 5 
Do not have disease 4,700 5,300 


The table thus shows a larger proportion of healthy people, reflecting 
a general population. 

Let's translate that into risk for you as an individual. Let’s assume you 
are 35-74 years old and use these numbers to deduce risk. 

Already from your age — before measuring any genetics — you 
know that you have 0.1% risk this year. This risk would also be modified a 
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lot by your sex, your cholesterol levels and your precise age. But also by 
your genetics. 

So, you go and buy a genetic test; if you find that you have the risk 
allele for this SNP, then you know that your risk is different: you are now 
in the group where 6 people had disease and 4,700 did not, and your 
probability is 0.12% instead of 0.10%. 

In contrast, if you had found out that you did not have the risk allele, 
then you would be in the group where 5 people had disease and 5,300 
did not and your probability of dying this year would be 0.09%. This is the 
reason why OR = 1.3 is preferred over saying a 30% risk increase. One 
could say that it is a 30% risk increase: 0.12 is roughly 30% more than 0.09. 
But too many people would misunderstand it as a literal 30% lifetime risk. 
And that would clearly be very far from the truth. 

This is the basis for many recommendations to basically not worry too 
much about single SNP findings for common disease. At least not before 
you have researched the OR value. Most risk SNPs for common things have 
odds ratios that are in the 1.1-1.5 range. 

The table below shows an example of a much stronger effect such 
as the effect of smoking on lung cancer (OR = 16). Here it makes a big 
difference which category you belong to — smoker or non-smoker. It is 
possible to have disease without smoking but it is a lot less likely. 


OR = 16 Smoker Non-smoker 


Have disease 14 1 


Do not have disease 7,000 8,000 


When doing personal genome analysis, and looking at one SNP at 
a time, you should therefore only worry if the SNP is indicated to have 
large OR (above 5), or else if it is discussed with words such as “complete 
disease association” or “completely penetrant”, such as is sometimes the 
case in family linkage studies. These words correspond pretty much to an 
OR value of infinity. 

Merely having a SNP that is associated with disease risk is not neces- 
sarily the same as an appreciable change in your own risk as an individual. In 
the next chapter, we will see how this can change when adding up several 
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low effect size SNPs, but in this chapter, you now have the main tool to 
interpret effect size of a single SNP at a time. 


5.7. Breast Cancer and the BRCA Mutations 


Breast cancer is unfortunately not a very rare disease. During their life- 
time, 12% of all women will develop breast cancer and 3% will die from it 
(U.S. Preventive Services Task Force, 2015). It is still relevant to discuss in 
this chapter about rare diseases because it also has a strongly hereditary 
component centring on only two genes: the BRCA1 and the BRCA2 gene. 

These two genes are known to contain damaging SNPs that vastly 
increase breast cancer risk. If either of these two genes have damaging 
BRCA SNPs — so-called nonsense mutations — the breast cancer risk 
increases to more than 50%. If we perform the odds ratio calculations as 
described above, then we find that these are very high effect SNPs indeed: 
above 10 OR at least, depending on ethnicity and mutation type. 

BRCA SNPs are of the autosomal dominant type. Recall that was 
the type where you only needed one copy for disease to happen. But, in 
contrast to the school book case of autosomal dominant inheritance that 
was described earlier, the risks are not completely penetrant. Incompletely 
penetrant — that basically means the same as the “more than 50% risk” 
stated above: it means that it’s not a 100% absolute certainty that the 
mutation will result in breast cancer. Approximately 0.5% of all women 
have damaging BRCA SNPs (Maxwell et al., 2016). 

Finding that you may have a damaging BRCA SNP is a serious matter. 
Particularly because that breast cancer risk is something where ignorance 
is not necessarily preferable. Through surgical removal of breasts (mas- 
tectomy), the breast cancer risk is almost completely removed. Therefore, 
these difficult trade-off considerations face many women when they are 
told that they have damaging BRCA SNPs. To make it even more difficult, 
ovarian cancer is also BRCA related: the lifetime risk of this disease is 1.5%, 
with a lifetime death risk of 1.0%. Mutations in the BRCA1 gene increases 
the ovarian cancer risk to 39% (U.S. Preventive Services Task Force, 2015). 

From the personal genetics point of view, the most important consid- 
eration is if the genetics test that you have bought can tell you about breast 
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cancer risk. Unfortunately, the answer is not completely straightforward. 
Recall that personal genetics today is primarily made with the microarray 
technology that only measures known SNPs. While known SNPs include 
some of the damaging BRCA SNPs there are many more potentially dam- 
aging BRCA SNPs that are not detected with a microarray. 

That means that you can never use a microarray test to exclude the 
presence of BRCA mutations. You may get “false negative” findings if you 
do that. That's because a microarray will never detect the hundreds of other 
potentially damaging BRCA SNPs and other damaging mutations that can 
increase breast cancer risk. 

Consumer genetics tests are therefore effectively blind to a large part 
of BRCA genetics, and only clinical BRCA gene sequencing is used for real 
breast cancer risk quantification. 

On the other hand, if a consumer genetics test does tell you that 
you have a known problematic BRCA SNP, then this should be taken very 
seriously. That's because measurements of any single SNP largely have the 
same accuracy and precision, regardless if the measurement is done with 
microarrays or DNA sequencing.’ That means that “positive” findings are 
highly likely to be true, and it is important to contact your medical doctor 
for further evaluation. 

As a public health consideration, this “highly likely true” becomes very 
important to discuss. That's because population-wide screening for anything 
always gives the problem of unnecessarily alarming people, through false 
positive findings. If we were to test a million women for BRCA mutations 
and breast cancer, we would expect to find numbers like these: 


OR = 12 Have BRCA allele Do not have BRCA allele 


Breast cancer 3k 110k 
No breast cancer 2k 890k 


‘Clinical grade DNA sequencing and CLIA certified microarrays both have replication rates 
above 99.9%. CLIA is a regulatory document called the Clinical Laboratory Improvement 
Amendments and it covers amongst other things the microarrays used by 23andme. The 
catch, of course, is that this only means measured SNPs, which is the real distinguishing 
factor between DNA sequencing and microarray. 
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Note these are not real study results — no one yet tested 1 million 
women like this. But the numbers do compare to true study findings (U.S. 
Preventive Services Task Force, 2015; Maxwell et al., 2016), so they serve 
well as an example. If we were to systematically screen a million women, 
we would identify 3,000 women that were truly in a high risk group and 
could be helped from such screening. We would also needlessly alarm 
many women. We would alarm the 2,000 women who would be told that 
they had a BRCA risk gene, but who'd otherwise have had no problems. 
Additionally, approximately a dozen women would even be told that they 
had a BRCA risk gene, when it was in fact just a measuring error. And for 
those two reasons, the current official recommendation is to screen only if 
you have one or more close relatives with breast cancer. 


5.8. Future Benefits of Personal Genetics 
in Rare Disease 


In conclusion, BRCA and breast cancer is an area of monogenic disease 
where personal genomics can provide real benefit for the individual. 
It may save a life by highlighting something dangerous but preventable 
or treatable. Conversely, there are good reasons why this is not rolled out 
on a systematic population-wide level. Too many people would needlessly 
be alarmed. Other specific cases exist, such as the early-onset hereditary 
hemochromatosis and hereditary thrombophilia, where potential parents 
with known carrier status are made aware at an early point that genetic 
counselling will be important for them. 

The overarching point that | hope you'll gain from this chapter is 
this: that sometimes personal genomics used for monogenic disease may 
provide real medical benefit to an individual. At the same time, however, it 
requires an increased understanding both of the limitations of effect sizes 
(the OR), but also the limitations of microarray — only common genetic 
variation is measured. 
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Genetics of Common 
Traits and Diseases — 
Many SNPs Together 


In this chapter, we will discuss all the common complex traits and diseases 
of humans. The reason geneticists call them “complex” is to contrast them 
with the monogenic rare diseases from the previous chapter. Maybe the 
naming is a bit unfair: the clinical handling of a rare genetic disease like 
Tay Sachs is definitely also a complex matter. But the disease is “simply” 
explained by a mutation in one single gene on chromosome 15. That's it. 
And genetically that makes it vastly simpler than most common diseases. 
So the scope of this chapter is anything common and complex which 
means anything ranging from your hair colour to your risk of myocardial 
infarction. Because genetically speaking, those are in the same concept 
group: Things affected by many genes at the same time. 

Another reason to group these things is that before genome-wide 
association studies got off the ground in 2009 (GWAS, from Chapter 2) we 
didn’t know so much about how any of the common complex traits were 
coded genetically. We just knew that there was a component of heritability 
(see Fig. 2.3). We knew this from the twin studies described in Chapter 2, 
but also from common sense observations, for example how parental height 
tended to influence the adult height of a child. There was even created 
a formula that can predict the adult height of a child based only on the 
height of the parents (Luo et al., 1998). It was within 10 cm of correct 95% 
of the time, which is actually not very precise. So, we knew it was heritable, 
but we also knew something was happening that was more complex, more 
complex than simple averages and more complex than the yes/no genetic 
variations than in previous chapters. It turned out that a large part of the 
complexity was because it was not just a few variations; the genotype of 
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hundreds of different SNPs all had an effect on the adult height of a person. 
Each of the SNPs had only a small effect, but together the effect became 
real. This was discovered with the genome-wide association studies. And 
with this knowledge we could suddenly explain at least some of this seem- 
ingly random variation. 


6.1. Discrete Units of Inheritance vs 
Continuous Traits 


To understand why this could seem random without genomic insight, let's 
focus on a single SNP with a fairly moderate effect on height. The rs806794 
SNP has an A variant (“taller”) and a G variant (“short”). Each A allele that 
a person has makes him or her 4.5 mm taller than a person with no A, on 
average.' Clearly not a very strong effect, but illustrative still — considered 
in the light of hundreds of similar height affecting SNPs. The genotype of 
this SNP as it is inherited in my own family is shown in Fig. 6.1. Both my 
grandparents (“Oldemor” and “Oldefar”) had the double A genotype, 
so “very” tall (A/A, +9.0 mm). Their children, my parents, have both also 
inherited a “short” G each from their other parents. So they are “medium” 
tall for that SNP (A/G, +4.5 mm). Since the choice of allele that is passed 
on is a 50%-50% chance, it randomly happened that | received an A from 
both my mother and my father, so “very”-tall (A/A, +9.0 mm). Again, the 
average height difference for each allele of this SNP was 0.45 mm, so we 
are not talking about visible differences at all. But many small rivers make 
a larger river, and the main point here is that genotypes do not pass on 
as averages, but instead in different discrete non-average units. This can 
happen for many SNPs, and when it happens at the same time in the same 
direction it explains why children are not necessarily the average of their 
parents. 


'The rs806794 SNP is indicated to have an effect size of 0.06 per A allele (Wood et al., 2014). 
Carefully reading the paper, we find that on page 56 of supplementary materials it is noted 
that the test metric is sex standardized height. One effect size therefore corresponds to 
7.5 cm * 0.06 = 0.45 cm, given that the human height standard deviation (per sex) is 7.5 cm. 
If you wish to approach primary scientific literature such ultra-detailed facts are often well 
hidden. 
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Fig. 6.1. Above: The genotype of the 4.5 mm height SNP for each of family participants 
as well as their summarized multi-gene score in parenthesis. The height gene score was cal- 
culated using 197 SNPs and is reported on an arbitrary scale. Below: Actual heights shown 
by birth year. The overlay shows average height at military recruitment for young men, also 
as a function of birth year. 


Incidentally this observation also concludes an age-old fight between 
genetics father Gregor Mendel and a group of researchers known as 
Biometricians. The fight was about how binary yes/no observations could 
ever explain quantitative traits. And the answer, it has turned out, is simply 
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that common complex traits are always also multigenic. So, Mendel was 
right, but in a setting of a lot more yes/no observations than he had ever 
imagined in his early studies of peas. 

Let's try to perform the genetic height calculation with all the 197 
known height SNPs instead of just one (Wood et al., 2014). The multi-SNP 
genetic height score is also shown in Fig. 6.1, in parenthesis by each indi- 
vidual. We will learn more about multi-SNP scores later, but basically, they 
are asum of all 197 single-SNP calculations like we just performed for the 
0.45 mm SNP. 

In the figure the non-averaging effects are clearly shown, also across 
hundreds of SNPs. From each parent, every single heterozygote (“A/G”) 
SNP had a chance of being passed on either as a higher value (A) or a lower 
value (G). By chance, the trend in my family reflects the trend of the single 
0.45 mm SNP introduced before — or rather, | chose that SNP because it 
mirrored the trend. In the figure, “Oldefar”, my grandfather, has a score 
of 10.0 making him by far the top scoring for genetic height. He is 184 cm 
tall and he was born in 1924. He passed on a 5.7 score to my mother. My 
mother (5.7) and my father (5.2), then passed on an 8.4 score for me and 
a 6.9 score for my sister. 

The first point then is that genetics do not necessarily pass on aver- 
ages, but as sums of many SNPs. The second point is that we can measure 
these SNPs, calculate the derived values and use these values to get insight 
into randomness. 


6.2. The Genetic Height vs the Actual Height 


My grandfather is supposed to be a towering 10.0 gene height guy. But 
what does that actually mean? This “gene-height” value, for starters, is 
arbitrary. That means it doesn’t translate directly into e.g. centimetres. It was 
calculated as we will discuss later in the chapter about genetic risk scores, 
but basically all it means is that someone with 10 is likely to be taller than 
someone with 9, that’s it. How much more likely is a very complex question 
that we will spend the next two sections discussing. 

First, let's just compare the values to actual heights. | plotted the 
actual heights of each of my European family members in Fig. 6.1. They are 
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shown by birth year (left-right position) and by sex (red-blue colour). The 
reason for that is to show that there are factors that are at least as important 
for height as genetics is. Men are on average taller than women. Sex is a 
major determinant of height, due to altered hormone production during 
adolescence. So even though my sister has a 6.9 genetic height score, my 
father is still taller than her in spite of a lower genetic height score. 

So, we learn that unsurprisingly a genetic height score is not the sole 
determinant of a trait. Arguably sex is a genetic factor: either you have 
a Y chromosome or you don’t. But since this height score was calculated 
from the 197 SNPs that are not on the sex chromosome the score does 
not reflect sex (they were autosomal). 

Birth year is definitely not genetic — it is a purely environmental factor 
that nonetheless is known to have a large impact on height. This impact 
probably has to do with advances in living standard and food intake, and 
it is clearly documented in the conscription height measurements that are 
overlaid in the chart: people become taller with time (Floud, 1984). That 
means that my grandfather at 184 cm was in fact a very tall man, at least 
for his age. 

It also explains why he, in spite of his large gene height, is approxi- 
mately the same real height as me and my father. And above alll, it illustrates 
the concept that an outcome, e.g. height, can be the sum of many factors: 
birth year, sex, and genetic score are just the ones that are obvious to check. 
Taking just these three into account my own grandmother, “Oldemor”, 
should have been taller, but she is not. And the reason for that is unknown 
and unmeasured, and illustrates that genetics — particularly complex trait 
genetics — are not at all deterministic. We will discuss the precision aspect 
later in this chapter. My grandmother herself thinks her height is the right 
one, and | don’t doubt that it is. 


6.3. Gene Scores Types and How 
to Calculate Them 
Now we will again introduce some more advanced mathematics, because 


we need to understand more about the genetic height score and similar 
scores for other traits. If you prefer to skip the mathematics, you safely 
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can jump ahead a few sections to the discussion on precision — just note 
that a genetic score is basically the sum of all the alleles (SNP-letters) and 
their effect on a trait. 

Genetic scores, genetic risk score (GRS), polygenic risk scores, mul- 
tigenic risk scores — the naming can vary a bit, | will just use gene score. 
But the goal is always the same: to take several SNPs in a person and from 
them calculate a relative score for a trait. When the trait is a yes/no trait 
(e.g. disease), we usually write risk score — if it's a numerical trait (e.g. 
height, blood pressure) we usually do not. The poly- or multigenic is used 
to underscore the difference from rare disease monogenic disorder calcu- 
lations. | will introduce four different ways to calculate gene scores, each 
successively more complex: 


6.3.1. #1 Effect allele count 


A gene score always requires a set of SNPs known to affect a trait, as well 
as knowledge of which allele from each SNP increases the level of the trait. 
We call this allele the effect allele. To calculate a gene score according to 
allele counting, we simply add up how many of these alleles a person has. 
That's it. This can serve the same purpose as the height gene scores from 
the previous section, where we can then compare between people but 
not assume much more about the final result. The È sign, sigma, means 
that we count effect alleles for all known SNPs for a trait: 


Count score = = Effect allele count,,,, 


6.3.2. #2 Weighted allele count 


The drawback of the first method is that different SNPs may have different 
effect sizes. The effect size is given either as something called beta, for 
numerical traits like height, or as odds ratio (OR), for yes/no traits like dis- 
ease risk as described in Chapter 5. A beta just means the amount of trait 
change each allele causes. The name comes from the slope of a line formula. 
A beta of 4.5 mm should be twice as important as a beta of 2.25 mm. 

To calculate the weighted allele count we first count the effect alleles, 
like in #1, and then we multiply each by the effect size. In the case of yes/no 
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traits where effect size is given as OR, one should first take the logarithm of 
the OR (because then a sum can be used): 


Weighted score = = Beta,,, * Effect allele count, 


6.3.3. #3 Zero-centred score 


Sometimes it is nice to have a gene score where you know that people with 
positive values have larger levels than average and people with negative 
values have lower levels than average. That'll give you a quicker insight 
into the actual trait level. For example, if someone has a zero-centred score 
that's higher than zero, then you also know that person is most likely above 
the average height — the amount being according to whichever unit the 
beta was. That's slightly more useful than arbitrary values. To calculate: 


Population score, = frequency, * 2 * beta 
snp snp snp 


Zero-centred score = = Beta, „ * Effect allele count, — Population score, 


6.3.4. #4 Z-score 


Finally, it is sometimes of use to express the score as something that is very 
standardised. Something we can compare across different traits. Luckily 
there exists such a mathematical metric, something called a “standard 
deviation”. A standard deviation is a measure of variation that follows 
some set rules. For example, 68% of everyone are — by definition — within 
1 standard deviation of a mean score. And 95% of everyone are within 
2 standard deviations. So, if your gene score for something is 2 standard 
deviations above the average, you know that you are in the top 2.5% level 
for that trait. 

Whatever the trait is, height or risk score, that’s often more useful 
knowledge than an arbitrary number. It is calculated simply by dividing 
the Zero-centred score with the standard deviation of the gene score in 
all the population: 

Z-score = Zero-centred score / Standard deviation 


population 


62 Understand Your DNA 


More gene score types will be invented in the coming years, but these 
four cover common intuitive use cases and serve well when discussing the 
use of many SNPs at a time. Already now there exist examples of calcu- 
lations that are based on more advanced machine learning and tens of 
thousands of input SNPs (Lello et al., 2017). These seem to yield even better 
prediction strength, but the mathematics is beyond the scope of this text. 

Ultimately, the central idea that we base a prediction on some combina- 
tion and weighting of effect allele counts will always be part of any prediction, 
and so will the considerations about precision in the following chapters. 


6.4. Typical Confusion in Gene Scores 
and Common SNP Genetics 


There are a lot of details to discuss and understand regarding these scores, 
and that’s even before we ask how much variation they really explain. | think 
the most common is to become confused about directions and alleles of a 
SNP. This is especially so, because unlike rare SNPs, it is not always obvious 
what is the normal case and what is the non-normal. We have talked about 
risk alleles, effect alleles, minor alleles, and A alleles and it's not always 
clear what the difference is. 

However, these in turn define the direction, which is very important 
since all SNPs must be calculated the same way in a gene score. To add to 
this confusion, different sources sometimes use different strand notation 
which change A to T and C to G, as well as biological exceptions, such as 
rare tri-allelic SNPs having A, T and C possibilities. Let's try to demystify a 
few of these complications. 

First the strand problem. If you dive into personal genetics you will 
stumble upon cases where a SNP is mentioned as e.g. a T/C type, but other 
sources — for example another scientific article — have it as an A/G type. 
These are not (necessarily) mistakes, but come from the fact that all DNA 
strands have a complimentary sequence intertwined — that’s the double 
helix. Unfortunately, scientists have not always agreed on which of the two 
strands to read from. That’s a problem because while the complimentary 
strand is otherwise identical in sequence, it has switched all A to T, and all 
C to G, and vice versa. 
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Back in Fig. 2.1, imagine there’s an extra sequence starting with 
TCCTITTG intertwined in a helix with each of the two sets of AGGAAAC 
that are already shown for each individual. This doesn’t make it easy at all, 
and we call this issue a strand flip. We do try to standardise this though, 
so it's sufficient if you just note this can be a problem in older literature. 
Trust newer literature. 

Secondly, the allele labels. Alleles are just the nucleotide letters, 
A, C, G, or T. But they come with different labels, depending on the context. 
Often alleles will be labelled as “minor allele” or “major allele”. Minor 
means the allele that is least frequent in a population. Unless the frequency 
is close to 50%, this can be a useful concept, for example with strand flip 
issues, because the minor allele in a population is a fixed thing no matter 
the strand. 

Also, very often the term minor allele frequency is used, which then 
of course gives the frequency of this allele in a population and a general 
measure of how rare or common a SNP is. Note though that in different 
ethnicities this can vary a lot, so the minor allele on one continent can be 
the major on another continent. 

Two other commonly used labels are effect allele and other allele. 
They are similar but not identical to risk allele and non-risk allele. Clearly 
the effect and risk alleles are the ones that do something. But effect allele 
refers to the beta (or OR), and that can be negative because of the way it 
is calculated. For example, the rs212524 height SNP is indicated to have 
effect allele T and a negative height beta of -0.15 mm (Wood et al., 2014). 
This obviously means that the C is the one increasing the height — but we 
still call T the effect allele. It would have summed up to the same result as 
having C as effect allele and beta of +0.15 mm, so it's purely a calculation 
thing. Likewise, for yes/no categories like disease risk where odds ratio 
are given, an OR less than 1 would mean that T was the effect allele, but 
C was the risk allele. 

Lastly, another thing to be aware of is tri-allelic SNPs. They do exist: 
SNPs are known to exist with both A or T or C alleles detected (still only 
two per individual), but the third allele is almost always so rare that it can 
be ignored for practical personal genomics. Additionally, the rarest of the 
three alleles usually will not be measurable with a microarray. 
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6.5. Utility of Different Gene Scores 


In some cases, it is worth noting that score #2 — the weighted allele 
count score is directly translatable to real world units, e.g. if the authors 
of a genetics study have taken care to be clear on what they use for beta 
values. For example, in a study by Cornelis and co-workers, it clearly says 
that one beta unit corresponds to one cup of coffee per day (Cornelis 
et al., 2015). 

With that you can calculate the #2 weighted allele count and then 
the output should correspond to a direct metric of how many coffee cups 
per day you consume as opposed to the baseline. The problem, however, 
is that this baseline is often fairly poorly defined. Maybe it's the average 
coffee consumption (or height, or disease risk), but often we don’t know. 
That’s the motivation for the Z-score; if you know that mean amount of 
coffee consumption is 1.84 +/— 0.85 cups — and such information is easier 
to find online — then you can quickly calculate the genetic best guess from 
that. My Z-score for coffee consumption is 0.42, meaning people with my 
genetic profile on average drinks 1.84 + 0.42 * 0.85 cups per day: 2.2 cups 
per day. That sounds about right. 

Another useful aspect of Z-scores is that it is easy to convert into per 
centages of population. For example, a Z-score of 0.42 is equal to knowing 
that people with my genetic profile typically consume more coffee than 
66% of the general population. Expressing your risk as a percentage of 
the general population is also easier to interpret. Even Excel has a function 
for that (norm.dist). 


6.6. Precision of a Gene Score — Not a Single 
Value, but a Better Guess 


If you chose to skip the more mathematical sections regarding gene scores 
you can start reading again here. The take home message of the previous 
three sections is this: the genetic component of common complex traits 
are dictated by combinations of many different SNPs and we can add these 
together to arrive at a single value, a gene score. 

One last, but very important question about gene score interpretation 
is still left open: if a person with my genetic profile typically consumes more 
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coffee than 66% of the general population — how sure can | be of how 
much coffee | actually consume (apart from counting it)? This is an impor- 
tant question — not for coffee — but if it had been your blood pressure or 
your disease risk, it would have been. The quick answer is that it’s always 
better to count it directly. Measure the blood pressure. Measure the height. 

The genetics are just a reflection of the known heritable component, 
as introduced in Fig. 2.3, Chapter 2. Because environment and genetics in 
combination produce a trait, the variability explained by genetics is always 
less than 100%. Of course, genetics is also a reflection of the future, and 
this is often why we are interested in it. 

To understand how much variability is explained we will again use 
the height example: the parent height based calculation and the height 
GWAS (Wood et al., 2014). Consider first a baby girl, born in 2012. We do 
not know her adult height, but we know her sex and the country and time 
at which she was born. Already from that we can make an educated guess 
about what her likely adult height will be. We can do so because we know 
that the average height for women is 169 cm and the standard deviation 
is 7.4 cm, written as 169 + 7.4 cm. According to the rules about standard 
deviation already introduced, we therefore know with 95% probability 
that she will be between 154 and 184 cm tall. Not a very impressive guess 
precision frankly, but not completely blind either. 

Let's then try to add knowledge about her parent's height, with the 
parent height formula (from caption of Fig. 6.2). This improves the guess to 
170 + 5.0 cm, so slightly taller — because the parents are both 1 cm taller 
than average — and also a slightly more precise guess; the bell curve is 
narrower. Now we know that she should be within 160 and 180 cm, with a 
95% probability. This difference is illustrated as the light grey and medium 
grey bell curves in Fig. 6.2. 

Then we add genetics — from twin studies we know that height is 
80% heritable, and from the height GWAS we know the SNPs that explain 
21-40% of this variance (Marouli et al., 2017; Wood et al., 2014; Lello et al., 
2017), see Fig. 2.3. So, when we measure and calculate a gene score, we 
can take that amount of variance out of our final height estimate, resulting 
in a more precise value. This is illustrated as the black bell curve in Fig. 6.2. 

The overall point of this is that common trait genetics can take a 
prediction and make it more precise. It cannot make it into a crystal clear 


66 Understand Your DNA 


Fig. 6.2. Height distribution of different groups of women: Light grey — all women. Medium 
grey — all women with parents who are 186 and 163 cm tall. Black — all women with a gene 
score of 7.3. The formula for estimating a woman's height from that of her parents is h,,, = 
37.85 + 0.75 (hy, + Anon) ¥2 (Luo et al., 1998). 


mom 


one value correct guess. The numbers in Fig. 6.2 correspond to the gene 
scores for my daughter's height as calculated in Fig. 6.1. That score sug- 
gests she will have an adult height of around 174 cm. Here, we see that it 
is not a single final value, but just an improvement on guess precision as 
illustrated by the Fig. 6.2 bell curves. 

When I presented these findings to my family around her third birthday, 
we ended up writing them down as a formal bet on her future height. | did 
bet exactly 174 cm. But to underscore the imprecision | took 1:5 odds on 
the bet. If she turns to be 174 cm | will be the happy receiver of 35 bottles 
of red wine. Otherwise | lose 7 bottles, and there unfortunately is a fair 
chance of that. We will know in 2030. 


Genetics of Common Traits and Diseases — Many SNPs Together 67 


6.7. Genetic Findings you can do Something 
About — and Those you Cannot 


Height is a good concrete example of a common complex trait. Every- 
one has a height, it is readily measurable and we know it has a genetic 
component. Everything we have learned about height genetics can be 
readily translated to any other numerical trait, including medical obser- 
vations: blood pressure, glucose level, calcium levels, age of Alzheimer’s 
onset, etc. 

With a few mathematical extra steps — that is the logarithm step in 
gene score type #2 — it can also be translated into yes/no events, including 
the risk of virtually any disease. In this case, we instead quantify the risk 
of disease as a parameter. In all cases the calculations and theory are the 
same, including the considerations of imprecision. It is just the list of SNPs 
and their effect sizes that are different for each trait. 

Since almost every common complex disease has been investigated 
using the GWAS approach, it means that you can calculate your gene 
score for every common disease. The calculation is described in the above 
sections, and it basically consists of identifying studies of interest and cal- 
culating the score as described. As previously discussed, many websites 
offer automated analysis of your data, but you should be aware that often 
the wrongful assumption is made that common complex disease is deter- 
mined by just one SNP. In this section | hope | have convinced you that it 
is not, and | think this is something where we will see a change in thinking 
over the coming years. 

You will be faced with possibilities of getting high risk scores for dis- 
eases that you cannot do anything about. | advise that you think thoroughly 
about this possibility, before you start searching. Some people would prefer 
not to have known, others want to know no matter what. 

Whichever is your choice | also advice that you pay much attention to 
the statistical nature of the gene score. It can modify your most likely risk 
profile and it can make the risk estimate more precise. But it is not 100% 
exact. No common complex disease has — by definition — that much 
influence from genetics. 
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6.8. Precision Medicine: Diagnosis vs 
Treatment Choice 


The overarching problem with using genetics to predict your risk of a 
common complex disease is that genetics is not an accurate crystal ball 
for these disease types. It may improve diagnosis somewhat, but because 
of the interplay between environment and genetics, it is theoretically 
impossible to ever precisely say at what age you'll get e.g. a stroke. We 
may see that genetics will start playing a role in diagnosis in combination 
with existing medical biomarkers. 

But if so, that would be outside the scope of personal genomics as 
it is today — because then you'd also need to have these other medical 
biomarkers measured on a case by case basis, and it would be more like a 
regular medical check-up. It may happen soon, but it is not the data that 
you have right now from your consumer test. 

| believe that another area than diagnosis will be of much more interest 
in complex genetics of the future: that is the area of traits that are directly 
related to otherwise uninformed choice. With this | think mainly of choice 
of medicine. Drug response traits is the formal name for this. 

Some people are known to react well to some drugs while other 
people will not have much benefit from the same drug. This is a well known 
thing in medicine, and it is the cause of the lengthy process of re-prescribing 
new types of medicine until the patient and the doctor finds something that 
works, also known as the “diagnostic odyssey”. In a medical setting one 
typically starts with the most well-established drug for a given diagnosis; 
usually this is also the cheapest since it has gone off patent. If it works both 
patient and doctor are happy, but if it doesn’t the odyssey continues to 
the next drug in the so-called treatment cascade. 

It is true for almost all common diseases that there exists more than 
one drug to treat it, and consequently this choice almost always exists. 
And it is true for all drugs that there exists individual variability in drug 
response levels. 

Like all traits, this drug response variability can be measured and 
investigated. We therefore know thousands of SNPs that can be used 
to predict how well a drug works, e.g. (Cui et al., 2013; Folkersen et al., 
2016; Natarajan et al., 2017; Johnson et al., 2013). Unfortunately, most of 
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How well a 
drug works 


Prediction: Any patient "Poor" "Good" "Poor" "Good" 
Scenario: No precision- Weak drug- Strong drug- 
medicine response predictor response predictor 
Fig. 6.3. Illustration of the precision medicine concept in three different scenarios. The no 


precision medicine is the current case for most diseases and drugs; for a given disease, a 
standard first line drug is given. If it turns out not to work, the next drug type is tried. There 
is a lot of variation in the how well the drug works. The weak drug response predictor is the 
current state of genetics research for a broad range of diseases, but it is not used clinically. In 
this case, we know SNPs that predict drug response, but their predictive effect is weak and 
only when averaging over many patients can we see a difference. The strong drug response 
predictor is the ideal; here a prediction of good response is highly likely to mean that the 
drug will work. It is currently in clinical use for some cancers, such as HER2-positive breast 
cancer (a non-germline mutation), where 1 in 5 patients will have a “Good” response profile 
and consequently be treated with HER2-specific drugs such as Trastuzumab/Herceptin. 


them correspond to the weak drug response predictor scenario in Fig. 6.3. 
They can predict drug response, just not very well. An exception to this 
is cancer, which is at the forefront of this field. Here, mutations arising in 
specific cells during your life time (“non-germline mutations”) have been 
proven to have very large potential for prediction. 

Unfortunately, such non-germline mutations are not detectable from 
the spit samples used in consumer genetics. But they exemplify the strong 
drug response predictor as shown in Fig. 6.3, which is an ideal scenario in 
precision medicine: We make a test, and give a decisive answer on which 
drug will work. 

In a sense these two scenarios correspond to the rare and common 
disease chapters in this book. One is common variation but weak effects. 
Another is rarer but stronger effects. It is an open question if we will ever 
find strong common genetic effects for all drug responses. Unlike the rest 
of the common complex disease chapter, however, these weaker findings 
may have clinical use already today. Since the choice of drug within a 
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treatment cascade is made on a trial and error basis, even weak genetic 
predictors may provide benefit to patients who, on average, will be given 
more appropriate treatment at an earlier time point. 

Knowledge like this, even with all the caveats of just being statis- 
tical guess, is much more useful in a context where choices are anyway 
being made largely blindly. | believe this area is something we will hear 
more about in the coming years, and it is highly relevant in the context of 
personal genomics — because many of the precision medicine SNPs are 
already readily available. 


6.9. Future Benefits of Personal Genetics 
in Common Complex Disease 


In conclusion, common complex disease is also an area where personal 
genomics may provide a benefit. However, | believe the real benefit is 
more likely to be indirect — by lowering the accessibility threshold for 
using precision medicine findings. If it is more burdensome for the patient 
and the doctor to order a weak drug response predictor test and wait for it 
than it is to just try the first drug in a treatment cascade, then we will miss 
out on opportunities for more personalised treatment. 

However, if genetic data is already available at the click of a touch- 
screen, then there is much opportunity in implementing a more fine-tuned 
choice of treatment, which may give a treatment benefit even in the weak 
response prediction scenario. 


The Future 


We have seen how virtually all human traits are affected by genetics. The 
effect can be small, as is often the case for common traits and diseases. 
If the genetic effects are small, they also come with a large amount of 
imprecision in how well they predict the trait (e.g. a disease). Large genetic 
effects on the other hand are typically more deterministic but only found 
in rare genetic diseases. 

Genetics thus exists in a spectrum from the strong deterministic, which 
is rare, to the indirect modulation of traits, which in contrast is part of the 
ives of most people. But everything about our life and health is affected by 
DNA. That is why genetics and DNA draws such attention in public debate. 


7.1. The Positive Aspects of Genetics 


In the previous chapters, | have attempted to always highlight insights that 
could be beneficial for you, individually, if you want to further investigate 
your own genome. The best examples of beneficial use are, of course, those 
that are already implemented in modern health care systems: Screening of 
newborn babies for genetic defects that are actionable (such as enzyme 
deficiencies), screening of heritable cases of breast cancer, or determination 
of optimal treatment choices in cancer (e.g. Fig. 6.3). These are all cases 
where we can learn something and act accordingly. 

In the near future, we can expect the areas of application to expand. 
The price of DNA sequencing is dropping, and we continuously expand 
our knowledge of how SNPs affect diseases — particularly combinations 
of many SNPs together. 
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These two things will lower the threshold for when a genetic 
investigation pays off. When a genetic investigation “pays off” is of course 
not meant to imply that a medical doctor would not perform an essential 
test for a genetic disease. Not if they suspect that there is a genetic disease. 

However, many medical tests are carried out with the sole purpose of 
additional diagnosis support or exclusion of less likely diagnosis (differential 
diagnosis). This is a good thing that will increase the quality of the health 
care. When any tests become cheaper and more routinely implementable, 
then the bar will be lowered for it to serve as an additional tool in the tool 
box of the medical doctor. This is also true for genetic tests. 

Early and precise diagnosis is a goal in itself. But we must not forget 
that it is only the strong genetic effects that can be used for stand alone 
diagnosis. These were the ones from Chapter 5, which typically were rare 
diseases. That a genetic effect must be strong is particularly true when 
population-wide screening is discussed. That's what we saw in the breast 
cancer example. It’s a bad idea to tell healthy people that they may be sick, 
but that’s the result when screening for something with a weak effect — 
regardless if it is a SNP or an early-stage cancer lump. 

This is a strong motivation for precision medicine, as introduced in 
Chapter 6. With that, patients are not created from healthy people. Instead 
we help patients in selecting the treatment that have the best chance of 
succeeding. When this is done, the genetic effects are often rather weak, 
unfortunately. That's at least what we have seen in major common diseases 
so far. As long as such incremental gains are the case, | think an important 
challenge will be to lower the threshold for how complicated it is for a 
patient and a medical doctor to retrieve such information. 

One way to lower complication thresholds is to have DNA data ready 
and measured already before it is needed. Already today it is possible 
to improve treatment of many common diseases by using this genetic 
information. But if the cost is the extra hassle and waiting time for further 
measurements, then the improvement is simply not worth it. 

These improvements in treatment choice can be obtained through 
consumer genetic measurements — at least to some degree. Throughout 
the book, we have seen examples of limitations: it's not possible to exclude 
BRCA mutations, it’s not possible to detect many of the mutations where 
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entire genes are destroyed, and it’s definitely possible to misinterpret a 
lot of the possible findings. 

Nonetheless consumer genetics have one advantage over clinical 
genetics: That it is available to you, right here, with no delay or intermedi- 
aries. And if the goal is to lower thresholds of accessibility, e.g. in precision 
medicine, then this is actually a considerable advantage. We can only act 
on the information that we currently have. 


7.2. The Scary Aspects of Genetics 


Do you want to know exactly when and where you are going to die? Do you 
want to know that you'll end your days with severe early-onset dementia? 
Catchy headlines for sure, but | hope that such headline grabbers have 
been put somewhat in perspective in this book. What we know about her- 
itability tells us that precise predictions are only possible in the rare cases. 

If you have a lucky DNA profile for cardiovascular diseases, then it is 
nonetheless a good idea to eat healthily and get plenty of exercise. That 
is also the case if you have a less lucky DNA profile. In both cases, your 
total health level will be increased by a healthier lifestyle. Genetics and 
environment do interact that way. And environment is what you can control. 

Regardless, the problem of scary news should not be underestimated. 
One of the main arguments against consumer genetics is that it is harmful 
for people to be anxious. No matter if the anxiety arises from a correct 
understanding or a misunderstood genetics report. If a person convinces 
him or herself that nothing matters because they are going to die, then 
their quality of life is measurably reduced. 

There are many ways to convince yourself that you'll die. You will 
inevitably be right. However, such a sad attitude serves no-one. It's better 
to live while you do. Exactly where to strike the balance between pessi- 
mism and optimism is not for me decide. However, | do not agree with 
the specific argument that we should ban things that make people sad, 
as it is currently being discussed in many European countries. If this was 
the case, we could just as well ban the internet and books on medicine. 

| believe that information and education is the way forward. It was 
actually because of exactly this type of debate, in a Danish radio programme 
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that | decided to write more about the subject (Koch, 2018). That, and then 
because | was curious about my daughter's hair colour... 

Where the debate ends, and how many European countries out- 
law consumer genetics is for the future to show. But in contrast to these 
unfounded fears of death and disease, it is worth discussing a range of 
other dangers that are real and tangible in a genetic future. 


7.3. The Truly Problematic Aspects of Genetics 


Imagine that a health insurance systematically excludes customers with high 
genetic risk scores for diseases that are particularly expensive to treat. Even 
a slight change of customer composition would matter a lot to the finances 
of such an insurance company, at the expense of the people with unlucky 
genetic profiles. This scenario has been debated before, and already in 
The Genetic Information Nondiscrimination Act of 2008, such exclusions 
were made illegal. The problem, however, is that such discrimination is 
very hard to prove, if the insurance company can covertly check your DNA 
results. So this is a really scary aspect. 

More speculatively, it is often debated if employers could use DNA 
based intelligence scores as part of employment testing. This is possible, 
as one can calculate a genetic IQ score. However, it would have exactly 
the same limitations that we saw for height and other common traits and 
diseases. | therefore believe that an ordinary intelligence test is more useful, 
at least if a company thinks that knowing the IQ level of their employees 
is important. There is information in your DNA that most people would 
agree should be kept private. Regardless if it is because you wish to buy 
health insurance, apply for jobs or something else. Along these lines, plenty 
of real worst-case scenarios are possible. The 1997 movie “Gattaca” has 
plenty more. 

Further regarding consumer genetics, there is one more thing you 
should know. The data that is measured is used for more than just provid- 
ing you with genetic reports. It is used by the genetic testing companies 
to improve genealogy research and research in medical genetics. Data is 
gold, and that is also the case for DNA data. A company such as 23andme 
has collaborations with many major pharmaceutical companies (e.g. Hinds 
et al., 2016). This will likely result in better drug development and better 
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medical research. Hopefully, the genetic testing companies will have 
adequate security and not misuse their position. 

In my own country, Denmark, we sometimes go even further. Recently 
an initiative called National Genome Centre was launched, wherein the 
ambition was to integrate all DNA testing of every citizen. If this specifically 
means that every person will have their DNA measured at birth or if we 
will arrive at a more ad-hoc solution is still up for debate. How we will use 
the data to gain medical benefits — that is a discussion that is still very 
open-ended. 

Currently the consensus seems to be to prioritise personal choice 
and ask everyone for consent at the time of need. This is probably a good 
decision. Personal choices are important. 


7.4. The Choice 


The right to make your own choice should be at the very centre of this. 
| don’t understand the medical doctors who wish to outlaw private genetic 
testing. That is a limitation for those people who choose curiosity. Likewise, 
| also find it difficult to agree with forced inclusion in national DNA data- 
bases. That is a limitation for those people who prioritise privacy. 

For that reason, | think and hope we will see more genetic solutions 
that are even more personal than those we have today. With focus on data 
ownership. This is already part of the trend for many other data-heavy 
aspects of our everyday life, for example communication, and it is also 
part of the EU's new personal data laws (EU, 2017). 

As far as | know there are no consumer genetic companies that have 
this emphasis on privacy and security. | think and hope this is something 
we will see in the near future. But, most of all, | hope that we will have a 
genetics debate that is more focused on the choice of the individual. 

My own choice is that | am very open with my DNA data. The worst 
| have ever heard from that is that my neuroticism score is incredibly high. 
Above the level of 99% of the general population. Then people laugh and 
say that they now do believe genetics. But | laugh with them and remember 
the limitations of such predictions. But that is my choice. 

Exactly how this balance between choices, benefits and privacy dan- 
gers will play out in the future, that is a question just as interesting as it 


76 Understand Your DNA 


is for the rest of the big data world that now surrounds us. By writing this 
book | hope | have contributed with a more realistic perspective, both on 
the possibilities and the limitations. Some voices in the debate claim that 
consumer genetics is worthless and dangerous (Christiansen & Gerdes, 
2017). Others try to sell you products that they claim can predict everything 
about you, including your superhero powers (Orig3n, 2017). 

My claim is that there are concrete benefits in genetics, including 
consumer genetics, but that it is key that we improve our understanding 
of how to interpret them. 

| hope that this book can provide some of the tools that are necessary 
for you to have this understanding. 
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Appendix A 


Operating Instructions for 
Impute.me 


A.1. Motivation 


You have now understood the underlying concepts of personal genetic analysis. 
You've been pointed in the direction of useful websites and genetic testing com- 
panies. You want to use this information and make your own hands-on analysis. Is 
this book not supposed to be the hands-on guide that does exactly that? 

It is, but inevitably the state of the art analysis is also going to be continuously 
improved in the years to come. So, learning the underlying concepts is the most 
important you can do. However, this last chapter will serve as a walk-through of the 
implementation of genetic analysis as | have set it up at the website impute.me. 
The reason | have not coupled the main text more tightly with the website is that 
it would quickly be outdated. Additionally, | will not pretend that impute.me is the 
only good analysis engine out there. However, as we will see, each of the chapters 
is in fact linked to a conceptual module in the impute.me website. 


A.2. The GWAS Module 


The GWAS module is all about genetic risk scores as covered in Chapter 6, i.e. traits 
that are governed by many SNPs. It takes advantage of the fact that all major GWAS 
have been catalogued and stored in public repositories. With knowledge of effect 
sizes and effect directions it is possible to calculate gene scores. This approach will 
show your level of something, e.g. a disease risk, based on the known genetics 
(Section 6.3.1.) This means that all traits that have been investigated with the GWAS 
methodology also can be calculated for yourself. The score will be a so-called 
Z-score, given as how many percent of a population have a lower gene score than 
you. Importantly, such a score is not the same as e.g. a lifetime disease risk. 
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To make such translation it is also necessary to know how common a disease is 
and how much of it is explained by genetics. It is likely that such information will be 
catalogued in the coming years, but particularly the latter is not yet systematically 
available. All you can currently deduce from your score is how large the known 
genetic component to a certain disease risk is, compared to the average for your 
ethnicity. The remaining pieces of information — the unknown genetic component 
and the environment, respectively — are not included. They are corresponding to 
the light blue and middle blue boxes in Fig. 2.3. To get a better understanding of 
the limitations of the calculation, it is recommended to study this figure, as well 
as Section 6.6. 

The GWAS module has a sister module called UK biobank. It works off a large 
population study performed in the United Kingdom (UK). This study is currently 
the world's largest population genetic study, having approximately ¥2 million par- 
ticipants. All participants were subjected to a detailed survey that was compared 
with their genetic profile as well as some of their health registry information. In this 
manner, it is possible to see how your genetic profile compares to the participants 
of the UK biobank study. Will your genome be more like those with self-reported 
risk-taking behaviour? Or those without? 

Like in the GWAS module, the score is not directly translatable to your overall 
risk of being a risk taker. Rather, the correct interpretation of these scores is that it is 
the known genetic component of a trait that is calculated. How much this explains 
if you really are a risk taker, that depends on many other things in your environment 
and upbringing that it is not possible to say merely from a genetic test. 


A.3. The Rare Diseases Module 


The rare diseases module is re-creating the results of the 23andme health interface 
before the FDA warning letter of 2013. These results were selected because they 
all had relatively high effect sizes for rather serious diseases, many of which were 
treatable if discovered. | liked them because they were chosen for a reason; they 
were seen as the most useful disease informing SNPs at the time. This usefulness 
has not disappeared. Many of them therefore are also found in the current carrier 
status reports (2018). 

Generally, when approaching single-SNP analysis the most important recom- 
mendation is to be very aware of effect sizes. The SNPs in this module are therefore 
strictly the ones with high effect, as selected by the geneticist at 23andme. You 
can find information about more single-SNP effects using e.g. SNPedia, but when 
moving beyond the list in this module it is important to be aware of the multigenic 
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nature of most traits. Interpreting one SNP at a time simply doesn’t make sense 
for many traits. This is what multi-SNP gene scores are for. 

In this module, users uploading data from 23andme arrays will find that most 
of the SNPs are available to them. Others, however, will often find that many are 
missing. That's because the 23andme arrays were specifically designed for these 
diseases, with proprietary SNP names like i4000440 and i3002759. The names are 
not secret though, these two are called rs138058578 and rs76992529, for example. 
This means that with imputation it is often possible to determine their genotype for 
users with other microarray types — regardless if it is measured or not. Often, but 
not always. This is the reason why non-23andme users may often find warnings for 
missing data. 


A.4. BRCA and Mutation Sensor Modules 


The basis of these two modules is the idea that imputation will allow determination 
of a slightly broader view of rare genetic variation. This is particularly of interest 
with the BRCA genes, because the overarching problem with BRCA genes and 
consumer genetics is that of microarray: much is measured but it is not a systematic 
measurement of every single nucleotide in these genes. 

In addition, some mutations are perfectly benign to carry; they 
cause no risk, regardless if you have one variant or another. This is the rea- 
son for the inclusion of information on the predicted pathogenicity level. 
These are the levels given as so-called clinvar, SIFT and PolyPhen scores, all 
of which represent systematic attempts at classifying all SNPs into how dangerous 
they really are. 

These two things constitute the challenge: We can’t see everything from a 
microarray, and of the BRCA SNPs we see, many are not important. 

The challenge, therefore, is to interpret what findings are important. The three 
23andme selected BRCA variants 14000377, i4000378, and 14000379 are the most 
important ones to look at first. They were selected for inclusion on the microarray 
because they are both dangerous and less rare than other BRCA mutations. In 2018, 
23andme received FDA approval for reporting these as cancer causing. 

Availability of additional mutations depends a lot on the data source that is 
uploaded. In general, | have been slightly disappointed with the ability to expand 
the view of BRCA mutations, and it is worth repeating that only BRCA gene sequenc- 
ing can accurately determine the presence of all potential BRCA gene mutations. 

Similarly, for the mutation sensor module: The idea here was to look more 
broadly for genes that were completely broken, which we know them to be some- 
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times, even in healthy people. The module will detect some of these genes, if you 
have them, but it has limited medical utility simply because we don’t know the 
consequence of many of these mutations. Yet. 


A.5. The Appearance Module 


The appearance module investigates height and hair colour. Both studies are 
referenced in the main text of the book, and the gene scores are calculated in 
the same way as the GWAS modules. The interesting thing in this module is that 
it simultaneously allows collection of “self-reported phenotype” from any user 
who wishes to contribute to that. This is highly illustrative for height, where there 
is a correlation between user reported height and genetically calculated height. 
Note that literature examples using more controlled data have found even higher 
correlations (e.g. Wood et al., 2014; Lello et al., 2017). Such data is the basis of 
the background “cloud” of likely distributions. 

If you wish to read off a likely adult height, e.g. for child, simply read off the 
Y-axis value at the point where the vertical gene score line passes the main part 
of the background cloud. Because this cloud is based on hundreds of thousands 
of measurements of both genetics and actual height, it represents the most likely 
values for the adult height for a person with that given genetic profile. 

In the same module, the self-reported hair colour works in a similar manner. 
Hair colour, however, is a more diffuse and non-constant trait throughout life, so 
less accuracy is achieved. Ultimately, however, the idea is to break down hair colour 
into blonde—black scores and red to not-red scores, and then calculate a most likely 
colour within those two dimensions. 

Another module, the politics module, was made in the same style as the height 
module. Whereas height is an easily measured, almost constant trait, with a large her- 
itable component, then political opinion is almost the polar opposite — genetically 
at least. For this reason, | wanted to test and illustrate how well it actually held up. 
The largest GWAS at the time (Hatemi et al., 2014) reported a set of political opinion 
associated SNPs, sufficient to build a gene score. In all fairness, that study did not 
claim to find any SNPs that were strongly associated with political opinion. But they 
did report a SNP list nonetheless. 

Pairing this gene score with user self-reporting of political opinion, this 
provided an opportunity to illustrate how much connection there was between 
genetics and politics. Unsurprisingly, the correlation was virtually non-existent. Even 
the most extreme gene scores were virtually indistinguishable in their self-reported 
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political opinions. In my opinion, this provides a nice illustration of the limitations 
of heritability. 


A.6. The Ethnicity Module 


The ethnicity module was described in Chapter 4, and illustrated in 
Fig. 4.3. The module works simply by performing a principal components analysis 
of the ~2000 SNPs that are most informative for ethnicity, when compared to the 
Thousand Genomes Project. The purpose of the module is simply to extract an 
accurate, but broadly defined ethnic background, e.g. “Asian” or “European”. 

The advantage of this approach is that it is very robust towards the isolated 
goal of defining a broad ethnic background. This is important for the overall site, 
because broad ethnic background is factored into the gene score calculations 
in e.g. the GWAS modules. Another advantage is that it is based on a project 
consisting of individuals whose ancestry has been evaluated by an anthropologist 
as part of a large consortium effort. One drawback is that the breadth of human 
ethnicity is wider than what is covered in the Thousand Genomes Project, so some 
ethnicities simply are not covered. Another drawback is that there is no division 
by chromosome. So, it is not possible to do the “genome painting”, which is what 
tells you more specifically which of your chromosomal parts originate from where. 
This can be of interest if you want to know what areas of the world different parts 
of your genome came from. 

All in all, my experience is that the ethnicity module is robust and accurate, 
but for some well investigated ethnicities, e.g. European, you probably will find 
finer resolutions at other analytical sites. Meanwhile, it is also my impression that 
precision is somewhat oversold in the field of ancestry genetics, as is discussed 
in the main text. 


A.7. Health Comparison Module 


Genetics is a field in fast development, and without doubt there will be many 
additional developments in the future. | believe, for example, that we will start 
to use a much tighter connection between gene scores and current health state. 
An overall genetic disease risk is often of little value to a healthy person; this is as 
already discussed in Chapter 7.2. However, it may be much more useful to know the 
genetic disease risk, in a situation where you are entering hospital to be evaluated 
for that disease or a similar disease. 
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Imagine, for example, that you are being evaluated for symptoms that your 
doctor suspected could be a form of gastrointestinal cancer. At precisely that 
moment it is useful for you to know your gene score for diseases that show similar 
symptoms, but are different. For example, if you have a very high genetic risk for 
Crohn's disease, that information could be highly useful to the diagnosing clinician. 

In the current working version of the health comparison module, this is 
presented as a map of all known diseases linked to gene scores that are known 
for you. The idea is that easier browsing will also enhance the usefulness of gene 
scores that are otherwise not useful outside of a specific clinical context. 


Index 


23andme, 26, 27, 38, 45, 47, 74 
ACTN3 sprint-SNPs, 26 

Allele, 9, 60 

Alpha-1 antitrypsin deficiency, 46 
Alzheimer's, 46-48, 67, 73 
Ancestry, 29 

Ancestry.com, 5, 26, 27, 38 
Animal, 11 

Autoimmune disease, 15 
Autosomal, 35, 43 


Blood pressure, 67 
BRCA, 51, 72 
Breast cancer, 51, 71 


Cardiovascular disease, 15, 49 
Carrier, 43, 46 
Chromosomes, 9 
Codegen.eu, 47 
Coffee, 64 
Coronary heart disease. 

See Cardiovascular disease 
Cystic Fibrosis, 41 


Data generation. See Microarray 
Depression, 16 

Deterministic. See Precision 
DNAGedcom, 38 

DNA. land, 25, 38 


DNA sequencing, 23, 43 
Dominance, 4, 10, 43, 44, 51 


Early-onset primary dystonia, 46 
Effect allele, 60, 62 
Environment, 15 


Factor XI deficiency, 46 

Familial Hypercholesterolemia, 41 
Family study, 13, 41 

Family Tree DNA, 5 

FDA, 45 


24genetics, 47 

Gaucher disease type 1, 46 
Genealogy, 39 

Geneplaza, 47 

Genes for Good, 5 

Genetic scores, 60 
Genome, 7 

Genotype, 9 


Glucose-6-phosphate dehydrogenase 


deficiency, 46 
Gregor Mendel, 57 
GRS. See Genetic scores 
GWAS, 14, 55 


Haemophilia, 41 
Hair colour, 3 


90 Understand Your DNA 


Half-identical, 36 

Height, 16, 55, 63 

Helix.com, 47 

Hereditary hemochromatosis, 46 
Hereditary thrombophilia, 46 
Heterozygous, 11 

Homozygous, 11 

Huntington's disease, 41 


Imputation, 24 
Impute.me, 25, 47 
Incompletely penetrant, 51 
ISOGG, 47 


Linkage disequilibrium, 24 
Linkage studies. See Family study 


Mental disorders, 15 

Microarray, 12, 22, 42, 52 

Minor allele, 62 

Missing heritability, 15 

Mitochondria, 35 

Multi-genic risk scores. See Genetic 
scores 

Mutation, 42 

My4é6.org, 45 

MyHeritage, 5, 27, 38 


Niemann Pick Disease, 41 
Odds ratio, 48 


Palaeogenetics, 32 
Parkinson's, 46 


Paternity, 2, 25 

PCR, 25 

Personalized medicine. See Precision 
medicine 

Phenylketonuria, 41 

Polygenic risk scores. See Genetic 
scores 

Precision, 26, 37, 52, 65, 71 

Precision medicine, 69, 72 

Prediction. See Precision 

Principal components analysis, 34 

Privacy, 75 

Promethease. See SNPedia 


Raw data, 21, 23 
Recessive, 10, 41, 43 
Relatives, 36 

Risk allele, 62 


Schizophrenia, 15 
Sex-linked, 35, 43, 44 
Smoking, 48, 50 

SNP, 7 

SNPedia, 46, 47 
Standard deviation, 61 
Stroke, 48 


Tay Sachs, 41, 55 
Thousand Genomes Project, 7, 24, 33 
Twins, 15, 36 


Y-chromosome, 35 


Z-score. See Standard deviation 


